Antlr 中具有固定节点而不是错误节点的 AST

发布于 2024-09-01 03:12:17 字数 1208 浏览 3 评论 0原文

我有一个 antlr 生成的 Java 解析器,它使用 C 目标,并且运行得很好。问题是我还希望它能够解析错误代码并生成有意义的 AST。如果我向它提供一个最小的 Java 类,其中包含一次导入,之后缺少分号,它会生成两个“树错误节点”对象,其中应包含“导入”标记和导入类的标记。

但由于它正确解析以下代码并为此代码生成正确的节点,因此必须通过添加分号或重新同步来从错误中恢复。有没有办法让antlr反映它在AST内部产生的固定输入?或者我至少可以以某种方式获得产生“树节点错误”的标记/文本吗?

在 C 目标中 antlr3commontreeadaptor.c 在第 200 行左右,以下片段表明 C 目标到目前为止仅创建虚拟错误节点:

static  pANTLR3_BASE_TREE
errorNode                               (pANTLR3_BASE_TREE_ADAPTOR adaptor,   pANTLR3_TOKEN_STREAM ctnstream, pANTLR3_COMMON_TOKEN startToken, pANTLR3_COMMON_TOKEN stopToken, pANTLR3_EXCEPTION e)
{
    // Use the supplied common tree node stream to get another tree from the factory
    // TODO: Look at creating the erronode as in Java, but this is complicated by the
    // need to track and free the memory allocated to it, so for now, we just
    // want something in the tree that isn't a NULL pointer.
    //
    return adaptor->createTypeText(adaptor, ANTLR3_TOKEN_INVALID, (pANTLR3_UINT8)"Tree Error Node");
}

我在这里运气不好吗?只有 Java 目标生成的错误节点才允许我检索错误节点的文本?

I have an antlr generated Java parser that uses the C target and it works quite well. The problem is I also want it to parse erroneous code and produce a meaningful AST. If I feed it a minimal Java class with one import after which a semicolon is missing it produces two "Tree Error Node" objects where the "import" token and the tokens for the imported class should be.

But since it parses the following code correctly and produces the correct nodes for this code it must recover from the error by adding the semicolon or by resyncing. Is there a way to make antlr reflect this fixed input it produces internally in the AST? Or can I at least get the tokens/text that produced the "Tree Node Errors" somehow?

In the C targets
antlr3commontreeadaptor.c around line 200 the following fragment indicates that the C target only creates dummy error nodes so far:

static  pANTLR3_BASE_TREE
errorNode                               (pANTLR3_BASE_TREE_ADAPTOR adaptor,   pANTLR3_TOKEN_STREAM ctnstream, pANTLR3_COMMON_TOKEN startToken, pANTLR3_COMMON_TOKEN stopToken, pANTLR3_EXCEPTION e)
{
    // Use the supplied common tree node stream to get another tree from the factory
    // TODO: Look at creating the erronode as in Java, but this is complicated by the
    // need to track and free the memory allocated to it, so for now, we just
    // want something in the tree that isn't a NULL pointer.
    //
    return adaptor->createTypeText(adaptor, ANTLR3_TOKEN_INVALID, (pANTLR3_UINT8)"Tree Error Node");
}

Am I out of luck here and only the error nodes the Java target produces would allow me to retrieve the text of the erroneous nodes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

病毒体 2024-09-08 03:12:17

我没有太多使用antlr,但通常处理此类错误的方式是添加匹配错误语法的规则,使它们产生错误节点,并在错误后尝试修复,以便可以继续解析。事后修复是一个问题,因为您不希望一个错误为每个新令牌触发越来越多的错误,直到最后。

I haven't used antlr much, but typically the way you handle this type of error is to add rules for matching wrong syntax, make them produce error nodes, and try to fix up after errors so that you can keep parsing. Fixing up afterwards is the problem because you don't want one error to trigger more and more errors for each new token until the end.

暖阳 2024-09-08 03:12:17

我通过向语法中添加新的替代规则来解决所有可能的错误语句,从而解决了这个问题。

每个 Java import 语句都会被转换为一个 AST 子树,例如以人工符号 IMPORT 作为根。为了确保我可以区分 AST 与正确代码和错误代码,错误语句的规则将它们重写为带有前缀 ERR_ 的根符号的 AST,因此在导入语句的示例中,人工根符号将为 ERR_IMPORT。

可以使用更多不同的根符号来编码有关解析错误的更详细信息。

我的解析器现在具有我需要的容错能力,并且每当我需要时,都可以很容易地为新类型的错误输入添加规则。不过,你必须注意不要在语法中引入任何歧义。

I solved the problem by adding new alternate rules to the grammer for all possible erroneous statements.

Each Java import statement gets translated to an AST subtree with the artificial symbol IMPORT as the root for example. To make sure that I can differentiate between ASTs from correct and erroneous code the rules for the erroneous statements rewrite them to an AST with a root symbol with the prefix ERR_, so in the example of the import statement the artifical root symbol would be ERR_IMPORT.

More different root symbols could be used to encode more detailed information about the parse error.

My parser is now as error tolerant as I need it to be and it's very easy to add rules for new kinds of erroneous input whenever I need to do so. You have to watch out to not introduce any ambiguities into your grammar, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文