如何指定“带有空格的贪婪标识符”在ANTLR?

发布于 2024-11-15 09:12:36 字数 1461 浏览 4 评论 0原文

假设我们的输入看起来像简单的英语语句序列,每个语句都在单独的行上,如下所示:

Alice checks
Bob bets 100
Charlie raises 100
Alice folds

让我们尝试用以下语法解析它:

actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 'calls' amount;
raise: 'raises' amount;
fold: 'folds';

name: /* The subject of this question */;
amount: '$'? INT;

INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';

不同动词的数量是固定的,但有趣的是我们正在尝试的名称匹配中可能有空格 - 并且动词也可能是其中的一部分!所以以下输入是有效的:

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

所以问题是:我们如何定义name,这样它就足够贪婪,可以吃掉我们通常视为动词但不是超级的空格和单词-贪婪,以便动词仍然可以通过 action 规则匹配?

我解决此任务的第一次尝试如下所示:

name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names

不幸的是,这不会匹配“总是下注的家伙”,因为赌注是不是 WORD,而是不同的标记,由 bets 规则中的文字定义。我想通过创建一个类似 keyword[String word] 的规则来解决这个问题,并让其他规则匹配,例如 keyword["bets"] 而不是文字,但这就是我陷入困境的地方。 (我想我可以将所有动词列为有效替代词,成为名称的一部分,但感觉不对。)

这里还有更多内容:所有名称是在使用之前声明的,因此我可以在开始解析 action 之前读取它们。并且它们的长度不能超过 MAX_NAME_LENGTH 个字符。这里有什么帮助吗?

无论如何,也许我做错了。 ANTLR 大师们,我能收到您的来信吗?

Suppose we have the input that looks like the sequence of simple English statements, each on a separate line, like these:

Alice checks
Bob bets 100
Charlie raises 100
Alice folds

Let's try parsing it with this grammar:

actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 'calls' amount;
raise: 'raises' amount;
fold: 'folds';

name: /* The subject of this question */;
amount: '

The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

So the question is: how do we define name so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action rule?

My first attempt at solving this task was looking like this:

name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names

Unfortunately, this will not match 'Guy who always bets', since bets is not a WORD, but a different token, defined by a literal in bets rule. I wanted to get around that by creating a rule like keyword[String word], and making other rules match, say, keyword["bets"] instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name, but it just feels wrong.)

Here is what more: all the names are declared before they are used, so I can read them before I start parsing actions. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?

Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?

? INT; INT: ('0'..'9')+; NEWLINE: '\r'? '\n';

The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:

So the question is: how do we define name so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action rule?

My first attempt at solving this task was looking like this:

Unfortunately, this will not match 'Guy who always bets', since bets is not a WORD, but a different token, defined by a literal in bets rule. I wanted to get around that by creating a rule like keyword[String word], and making other rules match, say, keyword["bets"] instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name, but it just feels wrong.)

Here is what more: all the names are declared before they are used, so I can read them before I start parsing actions. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?

Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冷︶言冷语的世界 2024-11-22 09:12:36

最简单的方法是对整个语法启用全局回溯。这通常是不推荐的,但我想你的语法将保持相对较小,在这种情况下,它对解析器的运行时不会有太大影响。如果您确实发现它变得很慢,您可以取消注释 memoize 选项,这将使您的解析器更快,但会消耗一些内存。

演示:

in.txt

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

Poker.g

grammar Poker;

options {
  backtrack=true;
  // memoize=true;
}

actions
  :  action* EOF
  ;

action
  :  name SPACES (bets | calls | raises | CHECKS | FOLDS) SPACES? (NEWLINE | EOF)
     {
       System.out.println($name.text);
     }
  ;

bets    : BETS SPACES amount;
calls   : CALLS SPACES amount;
raises  : RAISES SPACES amount;
name    : anyWord (SPACES anyWord)*;
amount  : '

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PokerLexer lexer = new PokerLexer(new ANTLRFileStream("in.txt"));
    PokerParser parser = new PokerParser(new CommonTokenStream(lexer));
    parser.actions();
  }
}

运行 Main 类会产生:

bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp antlr-3.3.jar org.antlr.Tool Poker.g 
bart@hades:~/Programming/ANTLR/Demos/Poker$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp .:antlr-3.3.jar Main
Guy who always bets 100
Guy who always checks
Guy who always calls
Guy who always folds
Guy who always checks and then raises bets by others

编辑

您可以用相反的方式执行此操作:否定您不想要的标记anyWord 匹配:

// other parser rules
anyWord : ~(SPACES | NEWLINE | DOLLAR); 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
DOLLAR  : '

anyWord 现在可以匹配除 SPACESNEWLINEDOLLAR标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的 ~ 之间的区别。

? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';

Main.java


运行 Main 类会产生:


编辑

您可以用相反的方式执行此操作:否定您不想要的标记anyWord 匹配:


anyWord 现在可以匹配除 SPACESNEWLINEDOLLAR标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的 ~ 之间的区别。

; SPACES : ' '+; NEWLINE : '\r'? '\n';

anyWord 现在可以匹配除 SPACESNEWLINEDOLLAR标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的 ~ 之间的区别。

? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';

Main.java

运行 Main 类会产生:

编辑

您可以用相反的方式执行此操作:否定您不想要的标记anyWord 匹配:

anyWord 现在可以匹配除 SPACESNEWLINEDOLLAR标记 > 的。请注意词法分析器规则(否定字符)和解析器规则(否定标记!)内的 ~ 之间的区别。

The easy way out would be to enable global backtracking on your entire grammar. This is normally not recommendable, but I guess your grammar will stay relatively small, in which case it won't matter much on the run-time of your parser. If you do find it becomes slow, you could un-comment the memoize option which will make your parser faster, at the cost of some memory consumption.

A demo:

in.txt

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

Poker.g

grammar Poker;

options {
  backtrack=true;
  // memoize=true;
}

actions
  :  action* EOF
  ;

action
  :  name SPACES (bets | calls | raises | CHECKS | FOLDS) SPACES? (NEWLINE | EOF)
     {
       System.out.println($name.text);
     }
  ;

bets    : BETS SPACES amount;
calls   : CALLS SPACES amount;
raises  : RAISES SPACES amount;
name    : anyWord (SPACES anyWord)*;
amount  : '

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PokerLexer lexer = new PokerLexer(new ANTLRFileStream("in.txt"));
    PokerParser parser = new PokerParser(new CommonTokenStream(lexer));
    parser.actions();
  }
}

Running the Main class produces:

bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp antlr-3.3.jar org.antlr.Tool Poker.g 
bart@hades:~/Programming/ANTLR/Demos/Poker$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp .:antlr-3.3.jar Main
Guy who always bets 100
Guy who always checks
Guy who always calls
Guy who always folds
Guy who always checks and then raises bets by others

EDIT

You could do it the other way around: negate the tokens that you don't want anyWord to match:

// other parser rules
anyWord : ~(SPACES | NEWLINE | DOLLAR); 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
DOLLAR  : '

anyWord now matches any token except SPACES, NEWLINE and DOLLAR's. Note the difference between ~ inside lexer rules (negates characters) and parser rules (negates tokens!).

? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';

Main.java


Running the Main class produces:


EDIT

You could do it the other way around: negate the tokens that you don't want anyWord to match:


anyWord now matches any token except SPACES, NEWLINE and DOLLAR's. Note the difference between ~ inside lexer rules (negates characters) and parser rules (negates tokens!).

; SPACES : ' '+; NEWLINE : '\r'? '\n';

anyWord now matches any token except SPACES, NEWLINE and DOLLAR's. Note the difference between ~ inside lexer rules (negates characters) and parser rules (negates tokens!).

? INT; anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; BETS : 'bets'; FOLDS : 'folds'; CHECKS : 'checks'; CALLS : 'calls'; RAISES : 'raises'; WORD : ('a'..'z' | 'A'..'Z')+; INT : '0'..'9'+; SPACES : ' '+; NEWLINE : '\r'? '\n';

Main.java

Running the Main class produces:

EDIT

You could do it the other way around: negate the tokens that you don't want anyWord to match:

anyWord now matches any token except SPACES, NEWLINE and DOLLAR's. Note the difference between ~ inside lexer rules (negates characters) and parser rules (negates tokens!).

愁以何悠 2024-11-22 09:12:36

简单的解决方案:按空格分割,逐字反转输入,然后从右侧而不是从左侧解析。 (当然,这需要重写你的语法。)

Simple solution: split on whitespace, reverse the input word-by-word, then parse from the right instead of from the left. (This requires rewriting your grammar, of course.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文