使用示例生成语法的工具?
这个答案展示了一个使用解析器生成器的漂亮示例浏览文本寻找一些感兴趣的模式。在该示例中,它是产品价格。
有谁知道根据训练示例生成语法的工具(我想要从中获得的文档+信息)?我找到了几篇论文,但没有工具。我浏览了一下 ANTLR 文档,但它涉及语法; “识别器”将语法作为输入,而不是训练示例。
This answer shows a pretty example of using a parser generator to look through text for some patterns of interest. In that example, it's product prices.
Does anyone know of tools to generate the grammars given training examples (document + info I want from it)? I found a couple papers, but no tools. I looked through ANTLR docs a bit, but it deals with grammars; a "recognizer" takes as input a grammar, not training examples.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个机器学习问题。您最多只能得到一个近似值。但我认为没有人在这方面做得很好,更不用说发布工具了。 (我积极跟踪人们为计算机语言构建语法所做的工作,这个想法已经被提出很多次,但我还没有看到有用的实现)。
问题在于,对于任何固定的示例集,都存在大量可能的语法。构建一个简单的语法很容易:对于一组固定的示例,只需提出一种语法,该语法具有一个规则来识别每个示例。这可行,但几乎没有帮助。现在的问题是,你可以用多少种方法来概括这一点,哪一种是最好的?事实上你无法知道,因为你的下一个新示例在结构方面可能会完全令人惊讶。 (理论定义:语言是组成它的句子的集合)。
我们甚至还没有讨论学习该语言的词位这一更简单的问题。您建议如何了解浮点数的合法字符串是什么?
This is a machine learning problem. You can at best get an approximation. But I don't think anybody has done this well, let alone released a tool. (I actively track what people do to build grammars for computer languages, and this idea has been proposed many times, but I have yet to see a useful implementation).
The problem is that for any fixed set of examples, there's a huge number of possible grammars. It is easy to construct a naive one: for the fixed set of examples, simply propose a grammar that has one rule to recognize each example. That works, but is hardly helpful. Now the question is, how many ways can you generalize this, and which one is the best? In fact you can't know, because your next new example may be a total surprise in terms of structure. (Theory definition: A language is the set of sentences that comprise it).
We haven't even talked about the simpler problem of learning the lexemes of the language. How would you propose to learn what legal strings for floating point numbers are?
执行此操作的一种工具是 NLTK。我强烈推荐它,并且涵盖该内容的 O'Reilly 书籍可以免费在线获取。有解析、学习语法等工具……唯一的缺点是它主要是一个研究工具而不是生产工具,所以重点不在于性能。
NLTK 能够从标记的训练样本构建语法,这正是您所要求的。看看优秀的文档和书籍。 (我上次使用它的经验也表明它可以通过 Jython 在 JVM 上运行,没有任何问题。)
One tool that does this is NLTK. I Highly recommend it, and the O'Reilly book that covers it is available free online. There are tools for parsing, learning grammars, etc... The only downside is that it is mainly a research rather than production tool, so the emphasis isn't on performance.
NLTK is able to construct grammar from labeled training samples, which is exactly what you are asking. Have a look at the great docs and the book. (My last experience with it also had it working on the JVM through Jython without any issues.)