如何处理 EBNF 语法中不同标记中的重叠字符组?
我正在使用 LL(k) EBNF 语法来解析字符流。我需要三种不同类型的令牌:
CHARACTERS
letter = 'A'..'Z' + 'a'..'z' .
digit = "0123456789" .
messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .
TOKENS
num = ['-'] digit { digit } [ '.' digit { digit } ] .
ident = letter { letter | digit | '_' } .
message = messageChar { messageChar } .
前两个令牌声明很好,因为它们不共享任何公共字符。
但是,第三个 message
无效,因为某些字符串可能既是 num
又是 message
(例如 "123"
),其他字符串可以是 ident
和 message
(例如 "Hello"
)。因此,分词器无法正确区分。
另一个例子是区分整数和实数。除非您要求所有实数至少有一位小数(意味着 1 需要编码为 1.0,这对我来说不是一个选项),否则我无法在语法中获得这两个数字之间差异的支持类型。我必须将所有值都表示为实数,并在该点之后进行检查。这很好,但不是最优的。我真正的问题是 message
令牌。我找不到解决方法。
所以问题是,我可以用 LL(k) EBNF 语法来做到这一点吗?我使用 CoCo/R 生成解析器和扫描器。
如果我无法使用 LL(k) EBNF 做到这一点,那么我还可以考虑哪些其他选项?
编辑这是我从 CoCo/R 得到的输出:
Coco/R (Apr 23, 2010) Tokens double and message cannot be distinguished Tokens ident and message cannot be distinguished ... 9 errors detected
I'm using an LL(k) EBNF grammar to parse a character stream. I need three different types of tokens:
CHARACTERS
letter = 'A'..'Z' + 'a'..'z' .
digit = "0123456789" .
messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .
TOKENS
num = ['-'] digit { digit } [ '.' digit { digit } ] .
ident = letter { letter | digit | '_' } .
message = messageChar { messageChar } .
The first two token declarations are fine, because they don't share any common characters.
However the third, message
, is invalid because it's possible that some strings could be both num
and message
(such as "123"
), and other strings could be both an ident
and a message
(such as "Hello"
). Hence, the tokenizer can't differentiate correctly.
Another example is differentiating between integers and real numbers. Unless you require all real numbers to have at least one decimal place (meaning 1 would need to be encoded as 1.0, which isn't an option for me) then I can't get support in the grammar for the differences between these two numeric types. I've had to go for all values being expressed as reals and doing the checking after the point. That's fine, but sub-optimal. My real problem is with the message
token. I can't find a workaround for that.
So the question is, can I do this with an LL(k) EBNF grammar? I'm using CoCo/R to generate the parser and scanner.
If I can't do it with LL(k) EBNF, then what other options might I look into?
EDIT This is the output I get from CoCo/R:
Coco/R (Apr 23, 2010) Tokens double and message cannot be distinguished Tokens ident and message cannot be distinguished ... 9 errors detected
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
试试这个:
哦,我必须指出
'\u0020'
是 unicode SPACE,您随后将用“- ' '
”将其删除。哦,如果您不需要多个字符前瞻,您可以使用CONTEXT (')')
。这在您的情况下不起作用,因为上面的所有标记都可以出现在')'
之前。FWIW:
CONTEXT
不会使用所包含的序列,您仍然必须在生产中使用它。编辑:
好的,这似乎有效。真的,这次我是认真的:)
ANY
将读取字符,直到它遇到“)”或空格。我将它放在一个连接每个值的循环中,但您可能不想这样做。您可能希望将其放入循环中,这样当它看到“over here”时,它不会返回“over”,而是返回“here”。您可以对 messageText 进行简单的长度检查,或进行其他有效性检查,例如将 t.val 添加到列表并检查计数。真的什么都可以。您还可以使用正则表达式进行测试,以确保它符合您需要检查的任何模式。
编辑(2011 年 4 月 8 日):
使用 Coco/R 与整数和实数的示例
编辑(2011 年 4 月 9 日)
或
Try this:
Oh, I have to point out that
'\u0020'
is the unicode SPACE, which you are subsequently removing with "- ' '
". Oh, and you can useCONTEXT (')')
if you don't need more than one character lookahead. This does not work in your case seeing as all the tokens above can appear before a')'
.FWIW:
CONTEXT
does not consume the enclosed sequence, you must still consume it in your production.EDIT:
Ok, this seems to work. Really, I mean it this time :)
ANY
will read character until it hits ')' or whitespace. I put it in a loop concatenating each value, but you may not want to do that. You may want to have it in a loop though so that it doesn't return "over" when it sees "over here", but "here".You can do a simple length check on messageText, or other validity checks such as adding t.val to a List and checking the count. Anything really. You can also do a test with a RegEx to make sure it complies with whatever pattern you need to check against.
EDIT (8 Apr 2011):
Example using Coco/R with integers and reals
EDIT (9 Apr 2011)
or
您可能需要研究具有上下文敏感标记化功能的 PEG 生成器。
http://en.wikipedia.org/wiki/Parsing_expression_grammar
我想不出你的方法将使用 COCO/R 或类似的方法来解决这个问题,因为每个令牌都需要明确。
如果消息被引号或其他消除歧义的方式包围,那么您就不会有问题。我真的认为 PEG 可能是您的答案,因为它也有有序选择(第一个匹配)。
另请查看:
http://tinlizzie.org/ometa/
You may want to look into a PEG generator which has context sensitive tokenization.
http://en.wikipedia.org/wiki/Parsing_expression_grammar
I cannot think of a way you will get around this using COCO/R or similar, as each token needs to be unambiguous.
If messages were surrounded by quotes, or some other way of disambiguating then you would not have a problem. I really think PEG may be your answer, as it also has ordered choice (first match).
Also take a look at:
http://tinlizzie.org/ometa/
尽管有标题,但这一切似乎都与扫描器有关,而不是解析器。我没有使用过CoCo/R,所以我不能直接评论它,但在典型的(例如,lex/Flex)扫描仪中,规则是按顺序考虑的,所以选择的规则/模式是第一个匹配。我写过的大多数扫描仪都包含一个“.”。 (即匹配任何东西)作为最后一个模式,如果有一些输入与任何其他规则都不匹配,则显示错误消息。
Despite the title, this all seems to relate to the scanner, not the parser. I haven't used CoCo/R, so I can't comment on it directly, but in a typical (e.g., lex/Flex) scanner, rules are considered in order, so the rule/pattern that's chosen is the first one that matches. Most scanners I've written include a '.' (i.e., match anything) as their last pattern, to display an error message if there's some input that doesn't match any other rule.