处理 JavaCC 中的标记歧义

发布于 2024-07-23 11:48:36 字数 1510 浏览 6 评论 0原文

我正在尝试用 JavaCC 编写一个解析器,它可以识别在标记级别上有一些歧义的语言。 在这种特殊情况下,该语言支持“/”标记本身作为除法运算符,同时它还支持正则表达式文字。

考虑以下 JavaCC 语法:

TOKEN : 
{
    ...
    < VAR : "var" > |
    < DIV : "/" > |
    < EQUALS : "=" > |
    < SEMICOLON : ";" > |
    ...
}

TOKEN :
{
    < IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
    < #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
    < #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) >  |

    < REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
    < #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
    < #REGEX_CHARS : ( <REGEX_CHAR> )* > |
    < #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
    < #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
    < #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
    < #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >

}

给出以下代码:

var y = a/b/c;

可以生成两组不同的标记。 令牌流应该是:

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>

如何确保 TokenManager 生成我在这种情况下期望的令牌流?

I'm attempting to write a parser in JavaCC that can recognize a language that has some ambiguity at the token level. In this particular case the language supports the "/" token by itself as a division operator while it also supports regular expression literals.

Consider the following JavaCC grammar:

TOKEN : 
{
    ...
    < VAR : "var" > |
    < DIV : "/" > |
    < EQUALS : "=" > |
    < SEMICOLON : ";" > |
    ...
}

TOKEN :
{
    < IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
    < #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
    < #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) >  |

    < REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
    < #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
    < #REGEX_CHARS : ( <REGEX_CHAR> )* > |
    < #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
    < #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
    < #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
    < #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >

}

Given the following code:

var y = a/b/c;

Two different sets of tokens could be generated. The token stream should be either:

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>

or

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>

How can I ensure that that TokenManager generates the token stream that I expect for this case?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

九厘米的零° 2024-07-30 11:48:36

JavaCC 将始终消耗最大的可用令牌,并且无法以其他方式对其进行配置。 实现此目的的唯一方法是添加一个词法状态,例如 IGNORE_REGEX,它排除标记,在本例中是 。 然后,当识别出后面不能跟 的标记时,必须将词法状态切换到 IGNORE_REGEX

使用输入:

var y = a/b/c

将发生以下情况:

  1. 被消耗,词汇状态设置为 DEFAULT
  2. 被消耗,词法状态设置为 IGNORE_REGEX
  3. 被使用,词法状态设置为 DEFAULT
  4. 被消耗,词法状态设置为 IGNORE_REGEX

    此时,语法中存在歧义,将使用

    。 由于词法状态为 IGNORE_REGEX 并且该状态与 不匹配,因此将消耗

  5. 被使用,词汇状态设置为 DEFAULT

  6. 被使用,词法状态设置为 IGNORE_REGEX
  7. 为已使用,词法状态设置为 DEFAULT
  8. 已使用,词法状态设置为 IGNORE_REGEX

JavaCC will always consume the largest token available and there is no way to configure it otherwise. The only way to accomplish this is by adding a lexical state, in case say IGNORE_REGEX, that excludes the token, in this case <REGEX_LITERAL>. Then, when a token is recognized that cannot be followed by <REGEX_LITERAL> the lexical state must be switched to IGNORE_REGEX.

With the input:

var y = a/b/c

The following would occur:

  1. <VAR> is consumed, lexical state is set to DEFAULT
  2. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
  3. <EQUALS> is consumed, lexical state is set to DEFAULT
  4. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX

    At this point, there is an ambiguity in the grammar, either a <DIV> or a <REGEX_LITERAL> will be consumed. Since the lexical state is IGNORE_REGEX and that state does not match <REGEX_LITERAL> a <DIV> will be consumed.

  5. <DIV> is consumed, lexical state is set to DEFAULT

  6. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
  7. <DIV> is consumed, lexical state is set to DEFAULT
  8. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
伤感在游骋 2024-07-30 11:48:36

据我记得(我曾经使用过JavaCC),

编写每个规则的顺序就是解析它的顺序,因此按照始终会生成您想要的表达式的顺序编写规则。

as far as i remember (i worked with JavaCC sometime back)

the order in which you write each rule is the order in which it would be parsed, so write your rules in an order which would always generate the expression that you want.

ぃ弥猫深巷。 2024-07-30 11:48:36

由于 JavaScript/EcmaScript 执行相同的操作(即,它包含正则表达式文字和除法运算符,看起来就像示例中的那些),因此您可能需要寻找现有的 JavaCC 语法来学习。 我发现一个链接到此博客条目,可能还有其他链接。

Since JavaScript/EcmaScript does the same thing (that is, it contains regex literals and a divide operator that look just like those in your examples) you might want to look for an existing JavaCC grammar to learn from. I found one linked to from this blog entry, there may be others.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文