支持歧义的 Java CFG 解析器

发布于 2024-10-10 02:58:44 字数 215 浏览 8 评论 0原文

我正在寻找一个用 Java 实现的 CFG 解析器。问题是我正在尝试解析自然语言。我需要所有可能的解析树(歧义),而不仅仅是其中之一。我已经研究了很多 NLP 解析器,例如斯坦福解析器。但它们大多需要统计数据(我没有树库),并且使它们适应新语言相当困难且记录很少。 我发现了一些解析器生成器,例如 ANTRL 或 JFlex,但我不确定它们是否可以处理歧义。那么哪个解析器生成器或 java 库最适合我呢? 提前致谢

I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language.
I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle ambiguities. So which parser generator or java library is best for me?
Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

猫烠⑼条掵仅有一顆心 2024-10-17 02:58:44

您需要一个使用 Earley 算法的解析器。我没有使用过这两个库,但是 PENPEP 出现在 Java 中实现该算法。

You want a parser that uses the Earley algorithm. I haven't used either of these two libraries, but PEN and PEP appear implement this algorithm in Java.

夏花。依旧 2024-10-17 02:58:44

另一个选择是 Bison,它实现了 GLR。 GLR是一种支持二义性语法的LR类型解析算法。除了 C++ 之外,Bison 还生成 Java 代码。

Another option is Bison, which implements GLR. GLR is an LR type parsing algorithm that supports ambiguous grammars. Bison also generates Java code, in addition to C++.

为你鎻心 2024-10-17 02:58:44

请查看此处的相关讨论。在该讨论的最后一条评论中,我解释说,您可以通过在派生失败之前克隆迄今为止派生的解析树,使任何解析器生成器生成所有解析树。

如果您的语法是:

G -> ...

您将增加如下:

G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.

解析引擎最终将在所有推导上失败,但您的程序将具有:

  • 保存所有树的克隆。
  • 处理每棵树被发现时的语义。

我教学时ANTLR和JavaCC都表现得很好。我更喜欢 ANTLR,因为它有 BNF 词法分析,而且它的历史、愿景、y 和许可要简单得多。

Take a look at the related discussion here. In my last comment in that discussion I explain that you can make any parser generator produce all of the parse trees by cloning the parse tree derived so far before making the derivation fail.

If your grammar is:

G -> ...

You would augment is as this:

G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.

The parsing engine will ultimately fail on all derivations, but your program will either have:

  • Saved clones of all the trees.
  • Dealt with the semantics of each of the trees as they were found.

Both ANTLR and JavaCC did well when I was teaching. My preference was for ANTLR because of its BNF lexical analysis, and its much less convoluted history, vision, y and licensing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文