从样本数据创建解析器语法

发布于 2024-12-11 09:10:16 字数 299 浏览 4 评论 0原文

我一直在四处寻找可以帮助用户创建语法的工具。有各种 IDE,但是......它们似乎是处理语法文件本身的文本编辑器。我正在寻找一种以数据为中心的方法。因此,假设我有很多想要用解析器解析的数据示例。因此,我想处理该示例数据并直接从中定义语法。

有没有现有的软件可以做类似的事情?

我将尝试更清楚地说明...

我提到的以数据为中心的方法是用户在数据样本中加载的位置。然后他们会选择其中的一些片段来表明它们是字段,或者选择项目并将它们标记为分隔符,等等。

与大多数 IDE 不同,我发现只有用于用语法语言本身编写的文本编辑器。

I have been looking around to see what's available as far as helping users create grammars. There are various IDE's, but... they appear to be text editors that work on the grammar file itself. I'm looking for something that works from a data-centric approach. So, let's say I have plenty of examples of the data I want to parse with the parser. So, I want to work through that sample data and define the grammar directly from it.

Is there any existing software that does something like that?

I'm going to try to be more clear...

A data-centric approach that I'm mentioning would be where the user loaded in a data sample. Then they would select pieces of it indicating they were fields, or select items and mark them as delimiters, or such.

As opposed to most of the IDE's I see out there are just text editors for writing in the grammar language itself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

在巴黎塔顶看东京樱花 2024-12-18 09:10:16

任何有限的字符串集合都构成了常规语言。编写接受这种语言的 NFA 是微不足道的。由此,您可以使用子集构造生成 DFA,并利用 DFA 对于不可区分关系的每个等价类只需要一个状态这一事实来最小化它。所以这是一个完全的算法过程......获取正则表达式和/或语法也同样简单。

话虽这么说,如果您想生成一个生成字符串以及可能其他字符串的语法......您的问题似乎不合适。对于任何有限的字符串集,无限多个语法会生成它们和其他字符串......数字的无限来自这样一个事实:只要您命中目标数据集,您就可以生成任何其他字符串。你的问题本质上是,“给定序列 a1, a2, ..., an, ... 的开头,说出接下来的 n 个元素是什么。”这是不可能做到的,除非您只是想要一些答案...在这种情况下,您始终可以从 DFA 开始并建议概括此问题的方法(即,仅接受更多字符串)。

事实上,给定例如常规语法,很容易引入新字符串......所以可以使用第一个答案作为起点。但请注意,从 NFA 到 DFA 的转换效率可能非常低......渐近指数。

Any finite set of strings constitutes a regular language. It is trivial to write an NFA accepting such a language. From this, you can generate a DFA using the subset construction, and minimize it using the fact that DFAs need only one state for each equivalence class of the indistinguishability relation. So it's a completely algorithmic process... getting a regex and/or grammar is then similarly straightforward.

That being said, if you want to generate a grammar that generates the strings and, possibly, others... your problem seems ill-posed. For any finite set of strings, infinitely many grammars generate them and other strings... the infinitude of the number coming from the fact that you can generate any other strings, so long as you hit the target dataset. Your question is essentially, "given the beginning of a sequence a1, a2, ..., an, ..., say what the next n elements are." This is impossible to do, unless you just want some answer... in which case you could always start with the DFA and suggest ways to generalize this (i.e., only accept more strings).

Indeed, given e.g. a regular grammar, it's easy to introduce new strings... so maybe use the first answer as a starting place. Note, though, that the conversion from NFA to DFA can be wildly inefficient... asymptotically exponential.

娇柔作态 2024-12-18 09:10:16

我不认为你想将其限制为 FSA,而是语法(无论是否上下文无关)。我建议查看 http://en.wikipedia.org/wiki/Grammar_induction;那里似乎有一些关于算法的讨论(对不起,不是软件)。

I don't think you want to limit this to FSAs, but rather grammars (whether context free or not). I suggest looking at http://en.wikipedia.org/wiki/Grammar_induction; there seem to be some discussions of algorithms (sorry, not software) there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文