写一个 C++ 需要多少时间?使用 flex/yacc 的编译器?

发布于 2024-08-16 01:17:36 字数 59 浏览 9 评论 0 原文

使用 lex/yacc 编写一个 C++ 编译器需要多少时间?

我可以从哪里开始使用它?

How much time would it take to write a C++ compiler using lex/yacc?

Where can I get started with it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

野稚 2024-08-23 01:17:37

Lex、yacc 还不够。你还需要一个链接器、汇编器......、c 预处理器。
这取决于你如何做。
您计划使用多少预制组件?
您需要从某处获取语法及其标记的描述。

例如,如果您使用 LLVM,则可以更快地进行。它已经提供了很多工具,汇编器,链接器,优化器......
您可以从 boost 项目获取 ac 预处理器..
您需要创建一个测试套件来自动测试您的编译器。

如果你每天都在努力,可能需要一年的时间,或者更少,你就会有更多的天赋和动力。

Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.

For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.

It can take a year if you work on it each day or much less you have more talent and motivation.

青衫负雪 2024-08-23 01:17:37

除非你已经编写过其他几个编译器; C++ 不是一种你甚至想从头开始编写编译器的语言,该语言有很多地方的含义需要大量上下文才能消除歧义。

即使您拥有丰富的编译器编写经验,您也需要花费数年时间来组建开发团队。这只是为了将代码正确解析为中间格式。编写后端来生成代码是另一项专门任务(尽管您可以窃取 gcc 后端)。

如果你用谷歌搜索“C++ 语法”,有一些可以帮助你入门。

C++ LEX  Tokens:   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer:  http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
                   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y

Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.

Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).

If you do a google for "C++ grammars" there are a couple around to get you started.

C++ LEX  Tokens:   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer:  http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
                   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
晨光如昨 2024-08-23 01:17:37

几年后 - 如果你能获得研究资助来重写新的 lex/yacc :-)

人们一直在追寻这个问题 - 从 Stroustrup 开始,他一直被幻想成为一名语言“设计师” “而不是真正的编译器编写者(请记住,他的 C++ 多年来只是一个代码生成器,如果没有 gcc 和其他人,它仍然会存在)。

核心问题是,自从 CPU 变得足够快以处理函数式语言和强力递归下降以来,对解析器生成器的真正研究几乎不复存在。当您不知道该怎么做时,递归下降是最后的手段 - 它会进行详尽的搜索,直到找到一个触发的“规则”。一旦你对此感到满意,你就会失去研究如何有效地做到这一点的兴趣。

你本质上需要的是一个合理的中间立场 - 就像 LALR(2) 具有固定的、有限的回溯(加上静态检查器,如果“设计者”挥霍到不确定的树中,就会大喊大叫)以及有限和分区的符号表反馈(现代解析器)需要并发友好)。

听起来像是一项研究资助提案,不是吗:-)现在如果我们能找到有人真正资助它,那将是一件好事:-))

A few years - if you can get research grant to re-write new lex/yacc :-)

People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).

The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.

What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).

Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))

永言不败 2024-08-23 01:17:37

C++ 编译器非常复杂。要实现足够的 C++ 以便与大多数 C++ 代码兼容,需要几个开发人员花费几年的时间。 clang 是一个由 Apple 资助的编译器项目,旨在为 C、C++ 和 Objective-C 开发新的编译器,拥有多名全职开发人员,并且 C++ 支持 还远未完成经过几年的发展。

A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.

走走停停 2024-08-23 01:17:37

递归体例是解析 C++ 的一个不错的选择。 GCC 和 clang 使用它。

Elsa 解析器(和我的 ellcc 编译器)使用 Elkhound GLR 编译器生成器。

无论哪种情况,编写 C++ 编译器都是一项艰巨的工作。

Recursive decent is a good choice to parse C++. GCC and clang use it.

The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.

In either case, writing a C++ compiler is a BIG job.

北斗星光 2024-08-23 01:17:37

那么,编写编译器是什么意思?

我怀疑是否有人已经制作了一个真正的 C++ 编译器,将其一路分解为汇编代码,但我已经使用 lex 和 yacc 制作了一个 C 编译器,而且我已经在没有使用的情况下完成了它。

使用这两种方法,您可以在几天内创建一个忽略语义的编译器,但弄清楚如何使用它们可能需要数周或数月的时间。无论如何,弄清楚如何制作一个编译器将需要数周或数月的时间,但我记得的数字是,一旦您知道它是如何工作的,使用 lex 和 yacc 需要几天时间,不使用 yacc 则需要几周时间,但第二个结果更好而且错误也更少,所以它们是否值得使用真的值得怀疑。

“语义”是实际的代码生成。这可以是非常简单的代码,足以工作并且可能根本不需要很长时间,或者您可以花一生的时间对其进行优化。

对于 C++,最大的问题是模板,但是还有很多小问题和规则,我无法想象有人愿意这样做。即使你完成了,问题是你不一定具有二进制兼容性,即能够被链接器或操作系统识别为可运行的程序,因为它不仅仅是 C++ 并且它很难确定标准,但有还有更多的标准需要担心,而这些标准的应用范围甚至更小。

Well, what do you mean by write a compiler?

I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.

Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.

The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.

With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.

柳若烟 2024-08-23 01:17:36

有许多解析规则无法被 bison/yacc 解析器解析(例如,在某些情况下区分声明和函数调用)。此外,有时标记的解释需要来自解析器的输入,特别是在 C++0x 中。例如,字符序列 >> 的处理很大程度上取决于解析上下文。

这两个工具对于解析 C++ 来说是非常糟糕的选择,为了正确解析 C++,您必须放入许多脱离这些工具所依赖的基本框架的特殊情况。这会花费你很长时间,即使这样你的解析器也可能会出现奇怪的错误。

yacc 和 bison 是 LALR(1) 解析器生成器,它们不够复杂,无法有效处理 C++。正如其他人指出的那样,大多数 C++ 编译器现在使用 递归下降 解析器,以及其他几个答案已经指出了编写您自己的解决方案的好方法。

C++ 模板不适合处理字符串,甚至是常量字符串(尽管这可能在 C++0x 中修复,我没有仔细研究),但如果是的话,您可以很容易地在 C++ 模板中编写递归下降解析器语言。我觉得这很有趣。

There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.

Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.

yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.

C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.

燕归巢 2024-08-23 01:17:36

听起来您对解析/编译器创建还很陌生。如果是这样的话,我强烈建议不要从 C++ 开始。这是一种语言的怪物。

要么发明一种你自己的简单的玩具语言,要么做一些模仿更小更简单的东西的东西。我看到一个 lua 解析器,其中语法定义大约有一页长。作为一个出发点,这会更合理。

It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.

Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.

烂柯人 2024-08-23 01:17:36

这可能需要您几年,并且在此过程中您可能会切换到其他解析器生成器。

众所周知,解析 C++ 很容易出错。该语法不完全可 LR 解析,因为许多部分是上下文相关的。你将无法让它在 flex/yacc 中正常工作,或者至少实现起来会很困难。据我所知,只有两个前端可以做到这一点。最好的选择是使用其中之一并专注于编写后端。无论如何,这就是有趣的地方:-)。

现有的 C++ 前端:

  1. EDG 前端 由大多数商业供应商(IntelPortland Group 等)在他们的编译器中。它要花钱,但非常彻底。人们为此付出了大笔金钱,因为他们不想承受编写自己的 C++ 解析器的痛苦。

  2. GCC 的 C++ 前端对于生产代码来说已经足够完善,但您必须弄清楚如何将其集成到您的项目中。我认为将其与 GCC 分开是相当复杂的。这也是 GPL,但我不确定这对你来说是否有问题。您可以通过 gcc_xml 在项目中使用 GCC 前端,但这只会为您提供类、函数、名称空间和类型定义的 XML。它不会为您提供代码的语法树。

  3. 另一种可能性是使用clang,但他们的C++支持目前是参差不齐。很高兴看到他们解决了所有错误,但是如果您查看他们的 C++ 状态页面< /a> 你会注意到有很多测试用例仍然失败。请注意——clang 是一个大项目。如果这些人需要花费数年时间来实现 C++ 前端,那么您将花费更长的时间。

  4. 其他人提到了ANTLR,并且有一个C++语法可供使用是的,但我对此表示怀疑。我还没有听说过 ANTLR 前端在任何主要编译器中使用,但我确实相信它在 NetBeans IDE 中使用。它可能适合 IDE,但我怀疑您能否在生产代码上使用它。

It will probably take you years, and you'll probably switch to some other parser generator in the process.

Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).

Existing C++ Front Ends:

  1. The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.

  2. GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.

  3. Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.

  4. Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.

江挽川 2024-08-23 01:17:36

很长一段时间,lex 和 yacc 将无济于事

如果您有能力为如此大的语言编写编译器,您将不需要 lex 和 yacc 为您提供的少量帮助。事实上,虽然 lex 还可以,但使用 yacc 可能需要更长的时间,因为它对于 C 或 C++ 来说还不够强大,而且您最终可能会花费比仅编写递归程序更多的时间来让它正常工作。下降解析器。

我相信 lex 和 yacc 最适合简单的语法,或者当值得付出额外的努力来获得可读性良好的语法文件时,也许是因为语法是实验性的并且可能会发生变化。

就此而言,整个解析器可能不是您工作的主要部分,具体取决于您对代码生成器的目标。

A long time, and lex and yacc won't help

If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.

I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.

For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.

不必在意 2024-08-23 01:17:36

正如其他人已经说过的,yacc 是实现 C++ 解析器的糟糕选择。一个人能做到;在 GCC 团队对维护和扩展的难度感到厌恶之前,最初的 GCC 就这样做了。 (Flex 作为词法分析器可能没问题)。

有人说递归下降解析器是最好的,因为 Bjarne Stroustrop 这么说。我们的经验是 GLR 解析是正确的答案,我们的 基于 GLR 的 C++ 前端end 是一个很好的证明,Elsa 前端也是如此。我们的前端已愤怒地使用了数百万行C++(包括Microsoft和GCC方言)来进行程序分析和大规模源代码转换。

但没有得到足够重视的是,解析只是构建编译器所需的一小部分,尤其是对于 C++ 而言。您还需要构建符号表(“这个标识符在这种情况下意味着什么?”)并且为此您需要对 C++ 标准的数百页中的大部分进行编码。我们相信,我们构建类似编译器的工具 DMS 的基础,非常适合这样做,我们花了一个人一年的时间才把这部分做好。

但接下来你需要考虑编译器的其余部分:

  • 预处理器
  • AST 构造
  • 语义分析和类型检查
  • 控制、数据流和指针分析
  • 基本代码生成
  • 优化
  • 寄存器分配
  • 最终代码生成
  • 调试支持

我一直在说:构建一个解析器(BNF部分)对于一种语言来说就像攀登喜马拉雅山的山麓。构建一个完整的编译器就像攀登珠穆朗玛峰。几乎任何土块都可以做到前者(尽管 C++ 处于边缘)。只有真正认真的人才会做后者,而且只有在准备充分的情况下。

构建一个 C++ 编译器预计会花费您数年时间。

(SD C++ 前端处理主要 C++ 方言的词法分析、解析、AST 生成、符号表、某些类型检查以及从 AST 重新生成可编译源文本,包括原始注释。它已经开发了一段时间大约6年)。

编辑:2015 年 5 月。原始答案写于 2010 年;我们现在已经投入了 11 年,一直到 C++14。关键是,建造其中一个需要无尽的巨大努力。

As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).

Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.

But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.

But then you have the rest of the compiler to consider:

  • Preprocessor
  • AST construction
  • Semantic analysis and type checking
  • Control, Data flow, and pointer analysis
  • Basic code generation
  • Optimizations
  • Register allocation
  • Final Code Generation
  • Debugging support

I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.

Expect building a C++ compiler to take you years.

(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).

EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.

明媚如初 2024-08-23 01:17:36

首先,SO 上的“flex”标签是关于 Adob​​e 的产品,而不是词法分析器生成器。其次,Bjarne Stroustrup 有记录称他希望使用递归下降而不是表驱动工具来实现 Cfront(第一个 C++ 编译器)。第三,直接回答你的问题——很多。如果您觉得需要编写一个,请看一下 ANTLR - 不是我最喜欢的工具,但有已经有 C++ 解析器了。

Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.

浅唱々樱花落 2024-08-23 01:17:36

这是一个不简单的问题,需要相当多的时间才能正确完成。一方面,C++ 的语法不能完全由 LALR 解析器(例如 yacc)解析。您可以执行该语言的子集,但使整个语言规范正确是很棘手的。

你不是第一个认为这很有趣的人。这是一篇关于该主题的精彩博客风格文章:
解析 C++

以下是文章中的重要引用:

“经过大量调查后,我
决定写一个
C++ 的解析器/分析工具是
足够困难的是
超出了我想做的爱好。”

这篇文章的问题是它有点旧,并且有几个链接已损坏。以下是一些关于编写 C++ 解析器主题的其他资源的链接:

This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.

You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++

Here's an important quote from the article:

"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."

The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文