处理 JavaCC 中的标记歧义

发布于 2024-07-23 11:48:36 字数 1510 浏览 12 评论 0原文

我正在尝试用 JavaCC 编写一个解析器，它可以识别在标记级别上有一些歧义的语言。在这种特殊情况下，该语言支持“/”标记本身作为除法运算符，同时它还支持正则表达式文字。

考虑以下 JavaCC 语法：

TOKEN : 
{
    ...
    < VAR : "var" > |
    < DIV : "/" > |
    < EQUALS : "=" > |
    < SEMICOLON : ";" > |
    ...
}

TOKEN :
{
    < IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
    < #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
    < #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) >  |

    < REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
    < #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
    < #REGEX_CHARS : ( <REGEX_CHAR> )* > |
    < #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
    < #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
    < #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
    < #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >

}

给出以下代码：

var y = a/b/c;

可以生成两组不同的标记。令牌流应该是：

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>

或

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>

如何确保 TokenManager 生成我在这种情况下期望的令牌流？

原文

I'm attempting to write a parser in JavaCC that can recognize a language that has some ambiguity at the token level. In this particular case the language supports the "/" token by itself as a division operator while it also supports regular expression literals.

Consider the following JavaCC grammar:

TOKEN : 
{
    ...
    < VAR : "var" > |
    < DIV : "/" > |
    < EQUALS : "=" > |
    < SEMICOLON : ";" > |
    ...
}

TOKEN :
{
    < IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
    < #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
    < #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) >  |

    < REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
    < #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
    < #REGEX_CHARS : ( <REGEX_CHAR> )* > |
    < #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
    < #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
    < #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
    < #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >

}

Given the following code:

var y = a/b/c;

Two different sets of tokens could be generated. The token stream should be either:

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>

How can I ensure that that TokenManager generates the token stream that I expect for this case?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九厘米的零° 2024-07-30 11:48:36

JavaCC 将始终消耗最大的可用令牌，并且无法以其他方式对其进行配置。实现此目的的唯一方法是添加一个词法状态，例如 IGNORE_REGEX，它排除标记，在本例中是。然后，当识别出后面不能跟的标记时，必须将词法状态切换到 IGNORE_REGEX。

使用输入：

var y = a/b/c

将发生以下情况：

被消耗，词汇状态设置为 DEFAULT
被消耗，词法状态设置为 IGNORE_REGEX
被使用，词法状态设置为 DEFAULT
被消耗，词法状态设置为 IGNORE_REGEX
此时，语法中存在歧义，将使用
或。由于词法状态为 IGNORE_REGEX 并且该状态与不匹配，因此将消耗
。
被使用，词汇状态设置为 DEFAULT
被使用，词法状态设置为 IGNORE_REGEX
为已使用，词法状态设置为 DEFAULT
已使用，词法状态设置为 IGNORE_REGEX

JavaCC will always consume the largest token available and there is no way to configure it otherwise. The only way to accomplish this is by adding a lexical state, in case say IGNORE_REGEX, that excludes the token, in this case <REGEX_LITERAL>. Then, when a token is recognized that cannot be followed by <REGEX_LITERAL> the lexical state must be switched to IGNORE_REGEX.

With the input:

var y = a/b/c

The following would occur:

<VAR> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<EQUALS> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
At this point, there is an ambiguity in the grammar, either a <DIV> or a <REGEX_LITERAL> will be consumed. Since the lexical state is IGNORE_REGEX and that state does not match <REGEX_LITERAL> a <DIV> will be consumed.
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX