解析 C/C++来源: lex/yacc 中如何指定令牌边界/交互?
我想解析一些 C++ 代码,作为指导,我一直在此处查看 C lex/yacc 定义: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html 和 http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
我了解代币本身的规格,但不了解它们如何交互。例如。可以使用 = 等运算符直接跟在标识符后面而不插入空格(即“foo=”),但不能在标识符后面紧跟着数字常量(即 123foo)。但是,我没有看到任何表示这些规则的方式。
我错过了什么?...或者这个 lex/yacc 在接受错误方面过于自由。
I want to parse some C++ code, and as a guide I've been looking at the C lex/yacc definitions here: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
I understand the specifications of the tokens themselves, but not how they interact. eg. it's OK to have an operator such as = directly follow an identifier without intervening white space (ie. "foo="), but it's not OK to have a numerical constant immediately followed by an identifier (ie. 123foo). However, I don't see any way that such rules are represented.
What am I missing?... or is this lex/yacc too liberal in its acceptance of errors.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
词法分析器将字符流转换为令牌流(我认为这就是令牌规范的意思)。语法指定哪些标记序列是可接受的。因此,你不会看到某些事情是不允许的;您只能看到允许的内容。这有道理吗?
编辑
如果重点是让词法分析器区分序列“123foo”和序列“123 foo”,一种方法是添加“123foo”的规范。另一种方法是将空格视为重要的。
EDIT2
语法错误可以从词法分析器或语法生成或编译器的后期阶段“检测到”(想想类型错误,它仍然是“语法错误”)。我认为,整个编译过程的哪一部分检测哪个错误很大程度上是一个设计问题(因为它影响错误消息的质量)。在给定的示例中,通过将“123foo”标记为无效标记来取缔“123foo”可能更有意义,而不是依赖于数字文字后跟标识符的产生式的不存在(至少,这是海湾合作委员会)。
The lexer converts a character stream into a token stream (I think that's what you mean by token specification). The grammar specifies what sequences of tokens are acceptable. Hence, you won't see that something is not allowed; you only see what is allowed. Does that make sense?
EDIT
If the point is to get the lexer to distinguish the sequence "123foo" from the sequence "123 foo" one way is to add a specification for "123foo". Another way is to treat spaces as significant.
EDIT2
A syntax error can be "detected" from the lexer or the grammar production or the later stages of the compiler (think of, say, type errors, which are still "syntax errors"). Which part of the whole compilation process detects which error is largely a design issue (as it affects the quality of error messages), I think. In the given example, it probably makes more sense to outlaw "123foo" via a tokenization to an invalid token rather than relying on the non-existence of a production with a numeric literal followed by an identifier (at least, this is the behaviour of gcc).
词法分析器可以处理 123foo 并将其分成两个标记。
但是尝试在语法中找到允许这两个标记像这样并排放置的部分。因此,我打赌词法分析器在看到这两个标记时会生成错误。
请注意,词法分析器不关心空格(除非您明确告诉它要担心)。在这种情况下,它只是扔掉空格:
只是为了检查这就是我构建的:
编辑文件 ll 以在编译器中停止抱怨未声明的函数:
创建以下文件:main.c:
构建它:
是的,它将它分成两个代币。
The lexer is fine with 123foo and will split that into two tokens.
But try and find the part in the syntax that allows those two tokens to sit side by side like that. Thus I bet the lexer is generating an error when it sees these two tokens.
Note the lexer does not care about whitespace (unless you explicitly tell it tow worry). In this case it just throws white space away:
Just to check this is what I built:
Edited file l.l to stop in the compiler complaining about undeclared functions:
Create the following file: main.c:
Build it:
Yes it split it into two tokens.
本质上是每个标记类型的词法规则都是贪婪的。例如,字符序列
foo=
不能解释为单个标识符,因为标识符不包含符号。另一方面,123abc
实际上是一个数字常量,尽管格式错误,因为数字常量可以以用于表示数字常量类型的字母字符序列结尾。what's essentially going on is the lexical rules for each token type are greedy. For instance, the character sequence
foo=
cannot be interpreted as a single identifier, because identifiers don't contain symbols. on the other hand,123abc
is actually a numerical constant, though malformed, because numerical constants can end with a sequence of alphabetic characters that are used to express the type of the numerical constant.您将无法使用 lex 和 yacc 解析 C++,因为它是一个不明确的语法。您需要一种更强大的方法,例如 GLR 或一些在运行时修改词法分析器的黑客解决方案(这就是当前大多数 C++ 解析器正在做的事情)。
看看艾尔莎/猎鹿犬。
You won't be able to parse C++ with lex and yacc, as it's an ambiguous grammar. You'd need a more powerful approach such as GLR or some hackish solution which modifies a lexer in runtime (that's what most of the current C++ parsers are doing).
Take a look at Elsa/Elkhound.