ANTLR 解析器问题
我正在尝试解析许多文本记录,其中记录中的元素由“+”字符分隔,整个记录由“#”字符终止。例如,E1+E2+E3+E4+E5+E6#
各个元素可以是必需的,也可以是可选的。如果一个元素是可选的,那么它的值就丢失了。例如,如果缺少 E2,则输入字符串将为:E1++E3+E4+E5+E6#。
然而,在处理空尾随元素时,分隔符字符(“+”)也可能会丢失。例如,如果缺少最后 3 个元素,则字符串可能是:E1+E2+E3#,但也可能是: E1+E2+E3+++#
我在Antlr中尝试了以下规则:
'R1' 'E1 + E2 + E3' '+'? ‘E4’? '+'? ‘E5’? '+'? ‘E6’? '#
但 Antlr 抱怨它含糊不清,当然这是正确的(E3 后面的每个标记都可能是 E4、E5 或 E6)。输入语法是固定的(它来自旧的大型机系统),所以我想知道是否有人可以解决这个问题?
另一种方法是在规则中指定所有不同的排列,但这将是一项主要任务。
致以最诚挚的问候和感谢,
迈克尔
I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个任务听起来对 ANTLR 来说太过分了,有什么原因你不使用“+”作为分隔符将字符串分割成数组吗?
如果它来自大型机,则很可能旨在以简单的方式进行处理。
例如,
C++:http://www.cplusplus.com/reference/clibrary/cstring/strtok /
PHP:https://www.php.net/manual/en/function。爆炸.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C#:http://msdn. microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
只是一个想法。
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : https://www.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
如果这是不明确的,可能是因为您的
E
都具有相同的格式(更复杂的情况是您的E
都以相同的开头 但我假设情况并非如此,这仍然有效;它只需要一个额外的步骤。)
k 个字符,其中
k
是您的前瞻, 看起来您最多可以有 6 个E
和最多 5 个+
。我们会说“段”是一个可选的E
后跟一个+
- 您可以有 5 个段,以及一个可选的尾部E
。这个语法可以大致表示如下(不完美的 ANTLR 语法,因为我对它不是很熟悉):
如果 ANTLR 不支持类似
{1,5}
的内容,那么这与以下内容相同:这不是那么干净,所以也许有更好的方法来做到这一点。
If this is ambiguous, it's likely because your
E
s all have the same format (a more complicated case would be that yourE
s all just start with the samek
characters wherek
is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)So it looks like you can have up to 6
E
s and up to 5+
s. We'll say a "segment" is an optionalE
followed by a+
- you can have 5 segments, and an optional trailingE
.This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
If ANTLR doesn't support anything like
{1,5}
then this is the same as:which is not that clean, so maybe there is a nicer way to do it.