ANTLR 规则消耗固定数量的字符
我正在尝试为 PHP serialize() 格式编写 ANTLR 语法,除了字符串之外,一切似乎都工作正常。问题在于序列化字符串的格式是:
s:6:"length";
就正则表达式而言,如果仅允许反向引用,则像 s:(\d+):".{\1}";
这样的规则将描述这种格式在“匹配数”计数中(但事实并非如此)。
但我找不到一种方法来表达词法分析器或解析器语法:整个想法是使读取的字符数取决于描述要读取的字符数的反向引用,如 Fortran Hollerith 常量(即 6HLength
),不在字符串分隔符上。
Fortran 的 ANTLR 语法 中的这个示例似乎指明了方向,但是我不明白怎么办。请注意,我的目标语言是 Python,而大多数文档和示例都是针对 Java 的:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;
I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}";
would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength
), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于像
s:3:"a"b";
这样的输入是有效的,因此您无法在词法分析器中定义String
标记,除非第一个和最后一个双引号是始终字符串的开头和结尾,但我想情况并非如此,因此,您需要这样的词法分析器规则:
换句话说:匹配
s:,然后是一个
整数
值,后跟:"
,然后是一个或多个可以是任何内容的字符,以";
结尾,但是你需要。告诉词法分析器在未达到Int
值时停止使用,您可以通过在语法中混合一些纯代码来实现此目的。您可以通过将其包装在中来嵌入纯代码。 {
和}
因此,首先将标记Int
保存的值转换为名为chars
的整数变量:现在在其中嵌入一些代码
( . )*
循环在chars
倒数到零时停止消耗:就是这样
一个小演示语法:
(请注意 。需要转义语法中的
%
!)和一个测试脚本:
它会产生以下输出:
Since input like
s:3:"a"b";
is valid, you can't define aString
token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.So, you'll need a lexer rule like this:
In other words: match a
s:
, then aninteger
value followed by:"
then one or more characters that can be anything, ending with";
. But you need to tell the lexer to stop consuming when the valueInt
is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside{
and}
. So first convert the value the tokenInt
holds into an integer variable calledchars
:Now embed some code inside the
( . )*
loop to stop it consuming as soon aschars
is counted down to zero:and that's it.
A little demo grammar:
(note that you need to escape the
%
inside your grammar!)And a test script:
which produces the following output: