JavaCC:如何从标记中排除字符串? (又名理解令牌歧义。)

发布于 2024-09-03 21:31:04 字数 1879 浏览 6 评论 0原文

我在理解如何在 JavaCC 中优雅地(或以某种方式)处理不明确的标记时已经遇到了很多问题。让我们举个例子:

我想解析XML处理指令。

格式为:" <数据> “?>”target 是 XML 名称,data 可以是除 ?>任何内容 code>,因为它是结束标记。

那么,让我们在 JavaCC 中定义它:
(我使用词法状态,在本例中是 DEFAULTPROC_INST

TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) >
TOKEN : <WSS : (" " | "\t")+ >   // WSS = whitespaces
<DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST}
<PROC_INST> TOKEN : {<PI_TARGET : <NAME> >}
<PROC_INST> TOKEN : {<PI_DATA : ~[] >}   // accept everything
<PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT}

现在是识别处理指令的部分:

void PROC_INSTR() : {} {
(
    <PI_START>
    (t=<PI_TARGET>){System.out.println("target: " + t.image);}
    <WSS>
    (t=<PI_DATA>){System.out.println("data: " + t.image);}
    <PI_END>
) {}
}

让我们用

识别目标:“target:mytarget”。 但现在我得到了我最喜欢的 JavaCC 解析错误:

!!  procinstparser.ParseException: Encountered "" at line 1, column 15.
!!  Was expecting one of:
!!      

什么也没遇到?没有期待什么吗?或者什么?谢谢 JavaCC!

我知道,我可以使用 JavaCC 的 MORE 关键字,但这会给我整个处理指令作为 one 标记,所以我必须自己进一步解析/标记它。我为什么要这么做?我是否正在编写一个不解析的解析器?

问题是(我猜):因此 识别“一切”,我的定义是错误的。我应该告诉 JavaCC 将“除了 ?> 之外的所有内容”识别为处理指令数据。

但怎样才能做到呢?

注意:我只能使用 ~["a"|"b"|"c"] 排除单个字符,但我不能 排除字符串,例如 ~["abc"]~["?>"]。 JavaCC 的另一个伟大的反特性。

谢谢。

I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example:

I want to parse XML processing instruction.

The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag.

So, lets define this in JavaCC:
(I use lexical states, in this case DEFAULT and PROC_INST)

TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) >
TOKEN : <WSS : (" " | "\t")+ >   // WSS = whitespaces
<DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST}
<PROC_INST> TOKEN : {<PI_TARGET : <NAME> >}
<PROC_INST> TOKEN : {<PI_DATA : ~[] >}   // accept everything
<PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT}

Now the part which recognizes processing instructions:

void PROC_INSTR() : {} {
(
    <PI_START>
    (t=<PI_TARGET>){System.out.println("target: " + t.image);}
    <WSS>
    (t=<PI_DATA>){System.out.println("data: " + t.image);}
    <PI_END>
) {}
}

Let's test it with <?mytarget here-goes-some-data?>:

The target is recognized: "target: mytarget".
But now I get my favorite JavaCC parsing error:

!!  procinstparser.ParseException: Encountered "" at line 1, column 15.
!!  Was expecting one of:
!!      

Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC!

I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse?

The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. I should tell JavaCC to recognize "everything except ?>" as processing instruction data.

But how can it be done?

NOTE: I can only exclude single characters using ~["a"|"b"|"c"], I can't exclude strings such as ~["abc"] or ~["?>"]. Another great anti-feature of JavaCC.

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

攀登最高峰 2024-09-10 21:31:04

关于分词器的说明

分词器 (*TokenManager) 匹配尽可能多的输入字符。 PI_DATA 是“~[]”(1 个字符),因此如果无法找到更长的匹配项,它将匹配任何单个输入字符。 PI_END 是“?>” (2 个字符),因此始终会匹配它而不是 PI_DATA。你的这部分语法是正确的。

意外的嫌疑人

麻烦实际上可能来自NAME。您没有写出该标记的实际定义,所以我只能对此做出假设。如果NAME的定义太贪婪,它将在PROC_INST状态下匹配太多的输入字符,并且您可能永远不会遇到PI_DATA或PI_END。

小心带有空格的“(...)+”,或者邪恶的“(~[])*”,它会吞噬 EOF 之前的所有内容。

其他嫌疑人

我看到的一个潜在问题是 PI_TARGET 可能会匹配多次,尽管您希望 PI_DATA 匹配。再说一次,我只能猜测,因为我没有 NAME 的定义。

您可能想要澄清的另一点是:您定义了 WSS 令牌,但不在状态 PROC_INST 中使用它。它应该是 PI_DATA 的一部分吗?如果没有,您可能想跳过它。

不要滥用分词器

如果您发现无法使分词器服从您,您可能需要将棘手的部分移至解析器。就您而言,可能很难区分 PI_TARGET 和 PI_DATA (如上所述)。

解析器可以期望 PI 目标之后的 PI 数据,而分词器不能(或很难)期望从一个标记到下一个标记。

解析器的另一个优点是,您甚至可以编写 Java 代码来查看下一个标记并做出相应的反应。这应该被视为最后的手段,但是当您必须执行诸如将多个标记连接到一个众所周知的标记之类的操作时,这会很有用。这可能就是您在这里要寻找的内容(使用 PI_END 作为终止符标记)。

最后,一个技巧

这是一个稍微简化语法的技巧:

  1. 跳过 PI_START,但仍然将状态更改为 PROC_INST
  2. PROC_INST 中,将 PI_DATA 定义为 MORE(并将其重命名为 PI_DATA_CHAR,或者根本不命名)
  3. 在 PROC_INST,从令牌图像中删除最后两个字符,发出 PI_DATA 并将状态更改为 DEFAULT
  4. 在您的解析器产品中,将处理指令简单地定义为 ,其中 PI_DATA 的令牌图像已准备好使用

有关操作令牌图像的详细信息JavaCC 的(稀疏...)文档中提供了标记器操作中的操作。就像设置 StringBuffer 的长度一样简单。

A word about the tokenizer

The tokenizer (*TokenManager) matches as many input characters as possible. PI_DATA is "~[]" (1 character), so it will match any single input character if it cannot find a longer match. PI_END is "?>" (2 characters), so it will always be matched instead of PI_DATA. This part of your grammar is correct.

An unexpected suspect

The trouble may actually come from NAME. You didn't write the actual definition of that token, so I can only make assumptions about it. If the definition of NAME is too greedy, it will match too many input characters in the state PROC_INST, and you may never encounter PI_DATA or PI_END.

Watch out for a "(...)+" with white spaces, or the evil "(~[])*" that eats everything up to EOF.

Other suspects

A potential problem I see is that PI_TARGET will probably be matched several times, though you would expect PI_DATA to be matched. Once again, I can only guess because I don't have the definition of NAME.

Another point you might want to clarify is this: you define the WSS token, but you don't use it in the state PROC_INST. Should it be a part of PI_DATA? If not, you may want to SKIP it.

Don't abuse the tokenizer

If you find out you cannot make the tokenizer obey you, you may want to move the tricky part to the parser instead. In your case, it's probably difficult to make the difference between PI_TARGET and PI_DATA (as mentioned above).

The parser can expect a PI data after a PI target, while the tokenizer cannot (or hardly) have expectations from a token to the next.

Another advantage of the parser is that you can even write Java code that peeks the next tokens and react accordingly. This should be considered as the last resort, but can be useful when you must do things such as concatenating multiple tokens up to a well-known one. This may be what you're looking for here (with PI_END as the terminator token).

Finally, a trick

Here is a trick to simplify your grammar a bit:

  1. Skip PI_START, but change the state to PROC_INST nevertheless
  2. In PROC_INST, define PI_DATA as MORE (and rename it to PI_DATA_CHAR, or just don't name it at all)
  3. In PROC_INST, remove the last two characters from the token image, issue PI_DATA and change the state to DEFAULT
  4. In your parser productions, define a processing instruction simply as , where the token image of PI_DATA is ready-to-use

Details about manipulating the token image in the tokenizer actions are provided in JavaCC's (sparse...) documentation. It's as easy as setting the length of a StringBuffer.

秋叶绚丽 2024-09-10 21:31:04

您的语法存在一个问题,即 WSS 仅适用于默认状态。重写为

<DEFAULT, PROC_INST> TOKEN : {< WSS: (" " | "\t")+ > \}

错误消息是它期待 WSS 但发现了“”。

至于排除整个字符串,常见问题解答中概述了多种方法。

One problem with your grammar is that WSS applies only in the default state. Rewrite as

<DEFAULT, PROC_INST> TOKEN : {< WSS: (" " | "\t")+ > \}

The error message is that it was expecting a WSS but found a " ".

As to excluding whole strings, there are several ways to do this outlined in the FAQ.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文