ANTLR 规则消耗固定数量的字符

发布于 2024-09-29 14:35:26 字数 860 浏览 5 评论 0原文

我正在尝试为 PHP serialize() 格式编写 ANTLR 语法,除了字符串之外,一切似乎都工作正常。问题在于序列化字符串的格式是:

s:6:"length";

就正则表达式而言,如果仅允许反向引用,则像 s:(\d+):".{\1}"; 这样的规则将描述这种格式在“匹配数”计数中(但事实并非如此)。

但我找不到一种方法来表达词法分析器或解析器语法:整个想法是使读取的字符数取决于描述要读取的字符数的反向引用,如 Fortran Hollerith 常量(即 6HLength),不在字符串分隔符上。

Fortran 的 ANTLR 语法 中的这个示例似乎指明了方向,但是我不明白怎么办。请注意,我的目标语言是 Python,而大多数文档和示例都是针对 Java 的:

// numeral literal
ICON {int counter=0;} :
    /* other alternatives */
    // hollerith
    'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
      {
      $setType(HOLLERITH);
      String str = $getText;
      str = str.replaceFirst("([0-9])+h", "");
      $setText(str);
      }
    /* more alternatives */
    ;

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :

s:6:"length";

In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).

But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.

This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:

// numeral literal
ICON {int counter=0;} :
    /* other alternatives */
    // hollerith
    'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
      {
      $setType(HOLLERITH);
      String str = $getText;
      str = str.replaceFirst("([0-9])+h", "");
      $setText(str);
      }
    /* more alternatives */
    ;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

且行且努力 2024-10-06 14:35:26

由于像 s:3:"a"b"; 这样的输入是有效的,因此您无法在词法分析器中定义 String 标记,除非第一个和最后一个双引号是始终字符串的开头和结尾,但我想情况并非如此,

因此,您需要这样的词法分析器规则:

SString
  :  's:' Int ':"' ( . )* '";'
  ;

换句话说:匹配 s:,然后是一个整数值,后跟:",然后是一个或多个可以是任何内容的字符,以";结尾,但是你需要。告诉词法分析器在未达到 Int 值时停止使用,您可以通过在语法中混合一些纯代码来实现此目的。您可以通过将其包装在 中来嵌入纯代码。 {} 因此,首先将标记 Int 保存的值转换为名为 chars 的整数变量:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
  ;

现在在其中嵌入一些代码( . )* 循环在 chars 倒数到零时停止消耗:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

就是这样

一个小演示语法:

grammar Test;

options {
  language=Python;
}

parse
  :  (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
  ;

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

Int
  :  '0'..'9'+
  ;

(请注意 。需要转义语法中的 %!)

和一个测试脚本:

import antlr3
from TestLexer import TestLexer
from TestParser import TestParser

input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()

它会产生以下输出:

parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.

So, you'll need a lexer rule like this:

SString
  :  's:' Int ':"' ( . )* '";'
  ;

In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
  ;

Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

and that's it.

A little demo grammar:

grammar Test;

options {
  language=Python;
}

parse
  :  (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
  ;

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

Int
  :  '0'..'9'+
  ;

(note that you need to escape the % inside your grammar!)

And a test script:

import antlr3
from TestLexer import TestLexer
from TestParser import TestParser

input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()

which produces the following output:

parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文