在词法分析器/解析器中删除所需的周围引号

发布于 2025-01-13 20:38:55 字数 1500 浏览 1 评论 0原文

我的几个项目在我的语法中都遇到了类似的效果。

我需要解析类似 Key="Value" 的内容，

因此我创建了一个语法（我可以用来显示效果的最简单的语法）：

grammar test;

KEY   : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;

DOUBLEQUOTE     : '"'           ;
EQUALS          : '='           ;

entry    : key=KEY EQUALS value=VALUE;

我现在可以解析 thing="One Two Three" 在我的代码中我收到

key = thing
value = "One Two Three"

总共在我的项目中，我最终需要一个额外的步骤来从通常是

这样的（我使用 Java）

String value = ctx.value.getText();
value = value.substring(1, value.length()-1);

在我的实际语法中，我发现很难将周围的 " 的检查移到解析器中。

有没有一种干净的方法可以通过在词法分析器/解析器中执行某些操作来删除 " ？

本质上我希望 ctx.value.getText() 返回 One Two三 而不是“一二三”

更新：

我一直在研究 Bart Kiers 提供的出色答案，并发现这个变体正是我所寻找的。通过将双引号放在隐藏通道上，词法分析器可以使用它们并对解析器隐藏。

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
   ;

  STRING
   : [ _a-zA-Z0-9.-]+
   ;

和

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value=STRING ;

原文

I several projects I have run into a similar effect in my grammars.

I have the need to parse something like Key="Value"

So I create a grammar (simplest I could make to show the effect):

grammar test;

KEY   : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;

DOUBLEQUOTE     : '"'           ;
EQUALS          : '='           ;

entry    : key=KEY EQUALS value=VALUE;

I can now parse thing="One Two Three" and in my code I receive

key = thing
value = "One Two Three"

In all of my projects I end up with an extra step to strip those " from the value.

Usually something like this (I use Java)

String value = ctx.value.getText();
value = value.substring(1, value.length()-1);

In my real grammars I find it very hard to move the check of the surrounding " into the parser.

Is there a clean way to already drop the " by doing something in the lexer/parser?

Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".

Update:

I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
   ;

  STRING
   : [ _a-zA-Z0-9.-]+
   ;

and

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value=STRING ;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梓梦 2025-01-20 20:38:55

试试这个：

VALUE
 : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE 
   {setText(getText().substring(1, getText().length()-1));}
 ;

不用说：这将您的语法与 Java 联系起来，并且（取决于您有多少嵌入式 Java 代码）您的语法将很难移植到其他目标语言。

编辑

一旦创建了令牌，就没有内置方法可以将其分离（除了在嵌入式操作中这样做，正如我所演示的那样）。您正在寻找的可以完成，但这意味着重写您的语法，以便字符串文字不会被构造为单个标记。这可以通过使用词汇模式以便可以在解析器中构造字符串。

快速演示：

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> type(DOUBLEQUOTE), popMode
   ;

  STRING_ATOM
   : [ _a-zA-Z0-9.-]
   ;

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value;

value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;

string_atoms : STRING_ATOM*;

如果您现在运行 Java 代码：

Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));

TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());

将打印以下内容：

One Two Three

Try this:

VALUE
 : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE 
   {setText(getText().substring(1, getText().length()-1));}
 ;

Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.

EDIT

Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.

A quick demo:

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> type(DOUBLEQUOTE), popMode
   ;

  STRING_ATOM
   : [ _a-zA-Z0-9.-]
   ;

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value;

value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;

string_atoms : STRING_ATOM*;

If you now run the Java code:

Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));

TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());

this will be printed: