为 Flex 定义了 C 令牌文件?
我想将一个C文件分割成标记,不是为了编译而是为了分析。我觉得这应该非常简单,并尝试在线查找已定义的所有 C 语法的 Flex 文件,但找不到任何内容。我想知道是否存在任何类型的定义语法,或者我是否认为这一切都是错误的?
I want to split a C file into tokens, not for compiling but for analyzing. I feel like this should be pretty straight-forward, and tried looking online for a defined tokens.l
(or something similar) file for flex with all the C grammar already defined, but couldn't find anything. I was wondering if there are any sort of defined grammars floating around, or if perhaps I'm going about this all wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,至少有一个。
编辑:
由于有一些问题无法处理,也许值得看看我几年前编写的一些(手写的)词法分析代码。这基本上只处理翻译的第 1、2 和 3 阶段。如果定义 DIGRAPH,它还会打开一些代码来翻译 C++ 二合字母。然而,如果没记错的话,它在翻译中的执行时间比实际发生的时间要早,但无论如何您可能都不希望这样做。 OTOH,这甚至不尝试识别任何接近所有标记的地方——主要是将源分为注释、字符文字、字符串文字和几乎所有其他内容。 OTOH,它确实处理三字母、行拼接等。
我想我还应该补充一点,通过在翻译(文本)模式下打开文件,这会将平台的行结束字符转换为换行符到底层实现。在大多数情况下,这可能是正确的做法,但如果您想生成类似交叉编译器的东西,其中源文件的行结束顺序与该主机的正常行结束顺序不同,您可能必须更改它。
首先是定义所有这些东西的外部接口的标头:
然后是所有这些的实现:
我不确定将其集成到基于 Flex 的词法分析器中有多容易/困难 - 我似乎回想一下,Flex 有某种钩子来定义它用来读取字符的内容,但我从未尝试过使用它,所以我不能对它说太多(最终,甚至不能说任何接近的东西)确定它甚至存在)。
Yes, there's at least one around.
Edit:
Since there are a few issues that doesn't handle, perhaps it's worth looking at some (hand written) lexing code I wrote several years ago. This basically only handles phases 1, 2 and 3 of translation. If you define DIGRAPH, it also turns on some code to translate C++ digraphs. If memory serves, however, it's doing that earlier in translation than it should really happen, but you probably don't want it in any case. OTOH, this does not even attempt to recognize anywhere close to all tokens -- mostly it separates the source into comments, character literals, string literals, and pretty much everything else. OTOH, it does handle trigraphs, line splicing, etc.
I suppose I should also add that this leaves conversion of the platform's line-ending character into a new-line to the underlying implementation by opening the file in translated (text) mode. Under most circumstances, that's probably the right thing to do, but if you want to produce something like a cross-compiler where your source files have a different line-ending sequence than is normal for this host, you might have to change that.
First the header that defines the external interface to all this stuff:
And then the implementation of all that:
I'm not sure about how easy/difficult it would/will be to integrate that into a Flex-based lexer though -- I seem to recall Flex has some sort of hook to define what it uses to read a character, but I've never tried to use it, so I can't say much more about it (and ultimately, can't even say with anything approaching certainty that it even exists).