如何指定“带有空格的贪婪标识符”在ANTLR?
假设我们的输入看起来像简单的英语语句序列,每个语句都在单独的行上,如下所示:
Alice checks
Bob bets 100
Charlie raises 100
Alice folds
让我们尝试用以下语法解析它:
actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 'calls' amount;
raise: 'raises' amount;
fold: 'folds';
name: /* The subject of this question */;
amount: '$'? INT;
INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';
不同动词的数量是固定的,但有趣的是我们正在尝试的名称匹配中可能有空格 - 并且动词也可能是其中的一部分!所以以下输入是有效的:
Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100
所以问题是:我们如何定义name
,这样它就足够贪婪,可以吃掉我们通常视为动词但不是超级的空格和单词-贪婪,以便动词仍然可以通过 action
规则匹配?
我解决此任务的第一次尝试如下所示:
name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names
不幸的是,这不会匹配“总是下注的家伙”,因为赌注
是不是 WORD
,而是不同的标记,由 bets
规则中的文字定义。我想通过创建一个类似 keyword[String word]
的规则来解决这个问题,并让其他规则匹配,例如 keyword["bets"]
而不是文字,但这就是我陷入困境的地方。 (我想我可以将所有动词列为有效替代词,成为名称
的一部分,但感觉不对。)
这里还有更多内容:所有名称
是在使用之前声明的,因此我可以在开始解析 action
之前读取它们。并且它们的长度不能超过 MAX_NAME_LENGTH 个字符。这里有什么帮助吗?
无论如何,也许我做错了。 ANTLR 大师们,我能收到您的来信吗?
Suppose we have the input that looks like the sequence of simple English statements, each on a separate line, like these:
Alice checks
Bob bets 100
Charlie raises 100
Alice folds
Let's try parsing it with this grammar:
actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 'calls' amount;
raise: 'raises' amount;
fold: 'folds';
name: /* The subject of this question */;
amount: '
The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:
Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100
So the question is: how do we define name
so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action
rule?
My first attempt at solving this task was looking like this:
name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names
Unfortunately, this will not match 'Guy who always bets', since bets
is not a WORD
, but a different token, defined by a literal in bets
rule. I wanted to get around that by creating a rule like keyword[String word]
, and making other rules match, say, keyword["bets"]
instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name
, but it just feels wrong.)
Here is what more: all the name
s are declared before they are used, so I can read them before I start parsing action
s. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?
Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?
? INT;
INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';
The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:
So the question is: how do we define name
so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action
rule?
My first attempt at solving this task was looking like this:
Unfortunately, this will not match 'Guy who always bets', since bets
is not a WORD
, but a different token, defined by a literal in bets
rule. I wanted to get around that by creating a rule like keyword[String word]
, and making other rules match, say, keyword["bets"]
instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name
, but it just feels wrong.)
Here is what more: all the name
s are declared before they are used, so I can read them before I start parsing action
s. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?
Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最简单的方法是对整个语法启用全局回溯。这通常是不推荐的,但我想你的语法将保持相对较小,在这种情况下,它对解析器的运行时不会有太大影响。如果您确实发现它变得很慢,您可以取消注释 memoize 选项,这将使您的解析器更快,但会消耗一些内存。
演示:
in.txt
Poker.g
? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';anyWord
现在可以匹配除SPACES
、NEWLINE
和DOLLAR
标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的~
之间的区别。Main.java
运行 Main 类会产生:
编辑
您可以用相反的方式执行此操作:否定您不想要的标记
anyWord 匹配:
anyWord
现在可以匹配除SPACES
、NEWLINE
和DOLLAR
标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的~
之间的区别。The easy way out would be to enable global backtracking on your entire grammar. This is normally not recommendable, but I guess your grammar will stay relatively small, in which case it won't matter much on the run-time of your parser. If you do find it becomes slow, you could un-comment the memoize option which will make your parser faster, at the cost of some memory consumption.
A demo:
in.txt
Poker.g
? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';anyWord
now matches any token exceptSPACES
,NEWLINE
andDOLLAR
's. Note the difference between~
inside lexer rules (negates characters) and parser rules (negates tokens!).Main.java
Running the Main class produces:
EDIT
You could do it the other way around: negate the tokens that you don't want
anyWord
to match:anyWord
now matches any token exceptSPACES
,NEWLINE
andDOLLAR
's. Note the difference between~
inside lexer rules (negates characters) and parser rules (negates tokens!).简单的解决方案:按空格分割,逐字反转输入,然后从右侧而不是从左侧解析。 (当然,这需要重写你的语法。)
Simple solution: split on whitespace, reverse the input word-by-word, then parse from the right instead of from the left. (This requires rewriting your grammar, of course.)