解析 C/C++来源: lex/yacc 中如何指定令牌边界/交互?

发布于 2024-10-03 18:27:12 字数 473 浏览 3 评论 0原文

我想解析一些 C++ 代码,作为指导,我一直在此处查看 C lex/yacc 定义: http://www.lysator.liu.se/c/ANSI-C-grammar-l.htmlhttp://www.lysator.liu.se/c/ANSI-C-grammar-y.html

我了解代币本身的规格,但不了解它们如何交互。例如。可以使用 = 等运算符直接跟在标识符后面而不插入空格(即“foo=”),但不能在标识符后面紧跟着数字常量(即 123foo)。但是,我没有看到任何表示这些规则的方式。

我错过了什么?...或者这个 lex/yacc 在接受错误方面过于自由。

I want to parse some C++ code, and as a guide I've been looking at the C lex/yacc definitions here: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http://www.lysator.liu.se/c/ANSI-C-grammar-y.html

I understand the specifications of the tokens themselves, but not how they interact. eg. it's OK to have an operator such as = directly follow an identifier without intervening white space (ie. "foo="), but it's not OK to have a numerical constant immediately followed by an identifier (ie. 123foo). However, I don't see any way that such rules are represented.

What am I missing?... or is this lex/yacc too liberal in its acceptance of errors.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一个人的夜不怕黑 2024-10-10 18:27:12

词法分析器将字符流转换为令牌流(我认为这就是令牌规范的意思)。语法指定哪些标记序列是可接受的。因此,你不会看到某些事情是不允许的;您只能看到允许的内容。这有道理吗?

编辑

如果重点是让词法分析器区分序列“123foo”和序列“123 foo”,一种方法是添加“123foo”的规范。另一种方法是将空格视为重要的。

EDIT2

语法错误可以从词法分析器或语法生成或编译器的后期阶段“检测到”(想想类型错误,它仍然是“语法错误”)。我认为,整个编译过程的哪一部分检测哪个错误很大程度上是一个设计问题(因为它影响错误消息的质量)。在给定的示例中,通过将“123foo”标记为无效标记来取缔“123foo”可能更有意义,而不是依赖于数字文字后跟标识符的产生式的不存在(至少,这是海湾合作委员会)。

The lexer converts a character stream into a token stream (I think that's what you mean by token specification). The grammar specifies what sequences of tokens are acceptable. Hence, you won't see that something is not allowed; you only see what is allowed. Does that make sense?

EDIT

If the point is to get the lexer to distinguish the sequence "123foo" from the sequence "123 foo" one way is to add a specification for "123foo". Another way is to treat spaces as significant.

EDIT2

A syntax error can be "detected" from the lexer or the grammar production or the later stages of the compiler (think of, say, type errors, which are still "syntax errors"). Which part of the whole compilation process detects which error is largely a design issue (as it affects the quality of error messages), I think. In the given example, it probably makes more sense to outlaw "123foo" via a tokenization to an invalid token rather than relying on the non-existence of a production with a numeric literal followed by an identifier (at least, this is the behaviour of gcc).

塔塔猫 2024-10-10 18:27:12

词法分析器可以处理 123foo 并将其分成两个标记。

  • 一个整型常量
  • 和一个标识符。

但是尝试在语法中找到允许这两个标记像这样并排放置的部分。因此,我打赌词法分析器在看到这两个标记时会生成错误。

请注意,词法分析器不关心空格(除非您明确告诉它要担心)。在这种情况下,它只是扔掉空格:

[ \t\v\n\f]     { count(); } // Throw away white space without looking.

只是为了检查这就是我构建的:

wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > l.l
wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > y.y

编辑文件 ll 以在编译器中停止抱怨未声明的函数:

#include "y.tab.h"

// Add the following lines
int  yywrap();
void count();
void comment();
void count();
int  check_type();
// Done adding lines

%}

创建以下文件:main.c:

#include <stdio.h>

extern int yylex();

int main()
{
    int x;
    while((x = yylex()) != 0)
    {
        fprintf(stdout, "Token(%d)\n", x);
    }
}

构建它:

$ bison -d y.y
y.y: conflicts: 1 shift/reduce
$ flex l.l
$ gcc main.c lex.yy.c
$ ./a.out
123foo
123Token(259)
fooToken(258)

是的,它将它分成两个代币。

The lexer is fine with 123foo and will split that into two tokens.

  • An integer constant
  • and an identifier.

But try and find the part in the syntax that allows those two tokens to sit side by side like that. Thus I bet the lexer is generating an error when it sees these two tokens.

Note the lexer does not care about whitespace (unless you explicitly tell it tow worry). In this case it just throws white space away:

[ \t\v\n\f]     { count(); } // Throw away white space without looking.

Just to check this is what I built:

wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > l.l
wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > y.y

Edited file l.l to stop in the compiler complaining about undeclared functions:

#include "y.tab.h"

// Add the following lines
int  yywrap();
void count();
void comment();
void count();
int  check_type();
// Done adding lines

%}

Create the following file: main.c:

#include <stdio.h>

extern int yylex();

int main()
{
    int x;
    while((x = yylex()) != 0)
    {
        fprintf(stdout, "Token(%d)\n", x);
    }
}

Build it:

$ bison -d y.y
y.y: conflicts: 1 shift/reduce
$ flex l.l
$ gcc main.c lex.yy.c
$ ./a.out
123foo
123Token(259)
fooToken(258)

Yes it split it into two tokens.

萌能量女王 2024-10-10 18:27:12

本质上是每个标记类型的词法规则都是贪婪的。例如,字符序列 foo= 不能解释为单个标识符,因为标识符不包含符号。另一方面,123abc 实际上是一个数字常量,尽管格式错误,因为数字常量可以以用于表示数字常量类型的字母字符序列结尾。

what's essentially going on is the lexical rules for each token type are greedy. For instance, the character sequence foo= cannot be interpreted as a single identifier, because identifiers don't contain symbols. on the other hand, 123abc is actually a numerical constant, though malformed, because numerical constants can end with a sequence of alphabetic characters that are used to express the type of the numerical constant.

一张白纸 2024-10-10 18:27:12

您将无法使用 lex 和 yacc 解析 C++,因为它是一个不明确的语法。您需要一种更强大的方法,例如 GLR 或一些在运行时修改词法分析器的黑客解决方案(这就是当前大多数 C++ 解析器正在做的事情)。

看看艾尔莎/猎鹿犬。

You won't be able to parse C++ with lex and yacc, as it's an ambiguous grammar. You'd need a more powerful approach such as GLR or some hackish solution which modifies a lexer in runtime (that's what most of the current C++ parsers are doing).

Take a look at Elsa/Elkhound.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文