如何阅读 EBNF 语法中的替代项

发布于 2024-09-06 08:14:15 字数 732 浏览 8 评论 0原文

我有一个 EBNF 语法，它有一些这种模式的规则：

sequence ::=
    item
    | item extra* sequence

上面的内容是否等同于下面的内容？

sequence ::=
    item (extra* sequence)*

编辑

由于你们中的一些人观察到两个序列中的错误或歧义，我将给出一个具体的例子。 SVG 规范提供了路径数据语法。这个语法有几个具有这种模式的生成器：

lineto-argument-sequence:
    coordinate-pair
    | coordinate-pair comma-wsp? lineto-argument-sequence

上面的内容可以重写如下吗？

lineto-argument-sequence:
    coordinate-pair (comma-wsp? lineto-argument-sequence)*

原文

I have an EBNF grammar that has a few rules with this pattern:

sequence ::=
    item
    | item extra* sequence

Is the above equivalent to the following?

sequence ::=
    item (extra* sequence)*

Edit

Due to some of you observing bugs or ambiguities in both sequences, I'll give a specific example. The SVG specification provides a grammar for path data. This grammar has several producers with this pattern:

lineto-argument-sequence:
    coordinate-pair
    | coordinate-pair comma-wsp? lineto-argument-sequence

Could the above be rewritten as the following?

lineto-argument-sequence:
    coordinate-pair (comma-wsp? lineto-argument-sequence)*

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只涨不跌 2024-09-13 08:14:15

事实并非如此，他们似乎有不同的错误。第一个序列围绕“item”是不明确的，因为“extra”是可选的。您可以将其重写为以下内容以消除歧义：

sequence3 ::= 
    item extra* sequence3

第二个在“extra”周围不明确，因为它基本上是两个都以“extra”开头的嵌套循环。您可以将其重写为以下内容以消除歧义：

sequence4 ::=
    item ((extra|item))*

您的第一个版本可能会因由单个“项目”组成的输入序列（这取决于解析器实现）而阻塞，因为它不会消除歧义。

我的重写假设您想要匹配以“item”开头的序列，并且可以选择后跟一系列（0个或更多）“item”或“extra”（以任意顺序）。

例如，

item
item extra 
item extra item
item extra extra item
item item item item 
item item item item extra

etc.

如果没有其他信息，我个人会倾向于我标记为“sequence4”的选项，因为所有其他选项仅使用递归作为昂贵的循环构造。如果您愿意给我更多信息，我也许可以给出更好的答案。

编辑：基于乔恩的出色观察（带有一个小模型）。

如果您重写“sequence3”以删除递归，您会得到以下结果：

sequence5 ::= 
    (item extra*)+

它认为这将是我更喜欢的版本，而不是“sequence4”。

我必须指出，上述所有三个版本在功能上都是等效的（作为识别器或生成器）。 3 的解析树与 4 和 5 不同，但我认为这不会影响性能以外的任何事情。

编辑：
关于以下内容：

lineto-argument-sequence:
    coordinate-pair
    | coordinate-pair comma-wsp? lineto-argument-sequence

此产生式表示，lineto-argument-sequence 由至少一个 coordinate-pair 后跟零个或多个 coordinate-pair 组成 由可选的白色/逗号分隔。以下任何内容都将构成一个lineto-argument-sequence（读作“成为”）：

1,2        -> (1, 2)
1.5.6      -> (1.5, 0.6)
1.5.06     -> (1.5, 0.06)
2 3 3 4    -> (2,3) (3,4)
2,3-3-4    -> (2,3) (-3,-4)
2 3 3      -> ERROR

因此坐标对实际上是任何2个连续的数量。

我在 ANTLR 中模拟了一个似乎有效的语法。请注意，lineto_argument_sequence 使用的模式与 Jorn 和我之前推荐的模式类似。

grammar SVG;

lineto_argument_sequence
    : coordinate_pair (COMMA_WSP? coordinate_pair)*
    ;

coordinate_pair
    : coordinate COMMA_WSP? coordinate
    ;

coordinate
    : NUMBER
    ;

COMMA_WSP
    : ( WS+|WS*','WS*) //{ $channel=HIDDEN; }
    ;

NUMBER
    : '-'? (INT | FLOAT) ;

fragment
INT
    : '0'..'9'+ ;

fragment
FLOAT
    : ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    | '.' ('0'..'9')+ EXPONENT?
    | ('0'..'9')+ EXPONENT
    ;

fragment
WS  : ' '  | '\t' | '\r' | '\n'  ;

fragment
EXPONENT
    : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

给定以下输入：

2, 3 -3 -4 5.5.65.5.6

它生成这个解析树。

替代文本 http://www.freeimagehosting.net/uploads/85fc77bc3c.png

Not really, they seem to have different bugs. The first sequence is ambiguous around "item" seeing that "extra" is optional. You could rewrite it as the following to remove ambiguity:

sequence3 ::= 
    item extra* sequence3

The second one is ambigous around "extra", seeing as it is basically two nested loops both starting with "extra". You could rewrite it as the following to remove ambiguity:

sequence4 ::=
    item ((extra|item))*

Your first version will likely choke on an input sequence consisting of a single "item" (it depends on the parser implementation) because it won't disambiguate.

My rewrites assume you want to match a sequence starting with "item" and optionally followed by a series of (0 or more) "item" or "extra" in any order.

e.g.

item
item extra 
item extra item
item extra extra item
item item item item 
item item item item extra

etc.

Without additional information I would be personally inclined towards the option I labled "sequence4" as all the other options are merely using recursion as an expensive loop construct. If you are willing to give me more information I may be able to give a better answer.

EDIT: based on Jorn's excellent observation (with a small mod).

If you rewrite "sequence3" to remove recursion you get the following:

sequence5 ::= 
    (item extra*)+

It think this will be my prefered version, not "sequence4".

I have to point out that all three versions above are functionally equivalent (as recognizers or generators). The parse trees for 3 would be different to 4 and 5, but I cannot think that that would affect anything other than perhaps performance.

EDIT:
Concerning the following:

lineto-argument-sequence:
    coordinate-pair
    | coordinate-pair comma-wsp? lineto-argument-sequence

What this production says is that a lineto-argument-sequence is composed of at least one coordinate-pair followed by zero or more coordinate-pairs seperated by optional white/comma. Any of the following would constitute a lineto-argument-sequence (read -> as 'becomes'):

1,2        -> (1, 2)
1.5.6      -> (1.5, 0.6)
1.5.06     -> (1.5, 0.06)
2 3 3 4    -> (2,3) (3,4)
2,3-3-4    -> (2,3) (-3,-4)
2 3 3      -> ERROR

So a coordinate-pair is really any 2 consecutive numbers.

I have mocked up a grammar in ANTLR that seems to work. Note the pattern used for lineto_argument_sequence is similar to the one Jorn and I recommended previously.

grammar SVG;

lineto_argument_sequence
    : coordinate_pair (COMMA_WSP? coordinate_pair)*
    ;

coordinate_pair
    : coordinate COMMA_WSP? coordinate
    ;

coordinate
    : NUMBER
    ;

COMMA_WSP
    : ( WS+|WS*','WS*) //{ $channel=HIDDEN; }
    ;

NUMBER
    : '-'? (INT | FLOAT) ;

fragment
INT
    : '0'..'9'+ ;

fragment
FLOAT
    : ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    | '.' ('0'..'9')+ EXPONENT?
    | ('0'..'9')+ EXPONENT
    ;

fragment
WS  : ' '  | '\t' | '\r' | '\n'  ;

fragment
EXPONENT
    : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

Given the following input: