有没有比 lex/flex 更好(更现代)的工具来生成 C++ 的分词器?

发布于 2024-08-19 21:51:29 字数 766 浏览 1 评论 0 原文

我最近将源文件解析添加到现有工具中,该工具从复杂的命令行参数生成输出文件。

命令行参数变得如此复杂,以至于我们开始允许它们作为一个文件提供,该文件被解析为一个非常大的命令行,但语法仍然很尴尬。因此我添加了使用更合理的语法解析源文件的功能。

我使用适用于 Windows 的 Flex 2.5.4 来生成此自定义源文件格式的标记器,并且它有效。但我讨厌这些代码。全局变量、奇怪的命名约定以及它生成的 C++ 代码非常糟糕。现有的代码生成后端粘在 flex 的输出上 - 我不使用 yacc 或 bison。

我即将深入研究该代码,并且我想使用更好/更现代的工具。有谁知道那件事吗。

  • 在 Windows 命令提示符下运行(Visual Studio 集成没问题,但我使用 make 文件来构建)
  • 生成正确的封装 C++ 标记生成器。 (无全局变量)
  • 使用正则表达式来描述标记规则(与 lex 语法兼容)
  • 不会强迫我使用 c 运行时(或伪造它)来读取文件。 (从内存中解析)
  • 当我的规则强制标记生成器回溯(或自动修复它)时警告我
  • 让我完全控制变量和方法名称(这样我就可以符合我现有的命名约定)
  • 允许我将多个解析器链接到一个解析器中.exe 没有名称冲突
  • 如果我想要它可以生成 UNICODE(16 位 UCS-2)解析器
  • 不是集成的分词器 + 解析器生成器(我想要 lex 替换,而不是 lex+yacc 替换)

我可能可以使用刚刚生成标记化表的工具(如果这是唯一可用的)。

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.

The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.

I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.

I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.

  • Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
  • Generates a proper encapsulated C++ tokenizer. (No global variables)
  • Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
  • Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
  • Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
  • Gives me full control over variable and method names (so I can conform to my existing naming convention)
  • Allows me to link multiple parsers into a single .exe without name collisions
  • Can generate a UNICODE (16bit UCS-2) parser if I want it to
  • Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)

I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浅唱々樱花落 2024-08-26 21:51:29

Ragel:http://www.complang.org/ragel/ 它符合您的大部分要求。

  • 它在 Windows 上运行,
  • 它不声明变量,因此您可以根据需要将它们放在类中或函数中。
  • 它有很好的工具来分析正则表达式,看看它们何时会回溯。 (我对此不太了解,因为我从未在 Ragel 中使用会创建回溯解析器的语法。)
  • 变量名称无法更改。
  • 表名称以机器名称为前缀,并且它们被声明为“const static”,因此您可以在同一个文件中放置多个表,并且在单个程序中可以有多个同名的表(只要它们是在不同的文件中)。
  • 您可以将变量声明为任何整数类型,包括 UChar(或您喜欢的任何 UTF-16 类型)。但它不会自动处理代理对。它也没有特殊的 Unicode 字符类(我认为)。
  • 它只做正则表达式...没有 bison/yacc 功能。

它生成的代码对程序的干扰很小。该代码的速度也非常快,而且 Ragel 语法比我见过的任何语法都更加灵活和可读。这是一款坚如磐石的软件。它可以生成表驱动的解析器或 goto 驱动的解析器。

Ragel: http://www.complang.org/ragel/ It fits most of your requirements.

  • It runs on Windows
  • It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
  • It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
  • Variable names can't be changed.
  • Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
  • You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
  • It only does regular expressions... has no bison/yacc features.

The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.

御弟哥哥 2024-08-26 21:51:29

Flex 还具有 C++ 输出选项。
结果是一组执行该解析的类。

只需将以下内容添加到 lex 文件的头部:

%option C++
%option yyclass="Lexer"

然后在源代码中它是:

std::fstream  file("config");
Lexer         lexer(&file)
while(int token = lexer.yylex())
{
}

Flex also has a C++ output option.
The result is a set of classes that do that parsing.

Just add the following to the head of you lex file:

%option C++
%option yyclass="Lexer"

Then in you source it is:

std::fstream  file("config");
Lexer         lexer(&file)
while(int token = lexer.yylex())
{
}
感性不性感 2024-08-26 21:51:29

Boost.Spirit.Qi(解析器-标记器)或Boost.Spirit.Lex(仅限标记器)。我非常喜欢 Qi,Lex 也不错,但我只是倾向于使用 Qi 来满足我的解析需求...

Qi 唯一真正的缺点往往是编译时间的增加,而且它的运行速度也比手工慢一些- 编写解析代码。不过,它通常比使用正则表达式解析快得多。

http://www.boost.org/doc /libs/1_41_0/libs/spirit/doc/html/index.html

Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...

The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.

http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html

被你宠の有点坏 2024-08-26 21:51:29

我想到了两个工具,尽管您需要自己找出合适的工具,AntlrGoldParser。这两个工具都有可用的语言绑定,可以将其插入到 C++ 运行时环境中。

There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.

寻梦旅人 2024-08-26 21:51:29

boost.spiritYard 解析器浮现在我的脑海中。请注意,使用词法分析器生成器的方法在某种程度上被 C++ 内部 DSL(特定于域的语言)所取代来指定标记。很简单,因为它是您代码的一部分,无需使用外部实用程序,只需遵循一系列规则来指定您的语法。

boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文