是否有任何免费的解析器生成器可以生成 C++正确编码和处理 Unicode?

发布于 2024-10-05 10:09:10 字数 434 浏览 3 评论 0原文

在问这个问题后,我现在我开始尝试使用解析器生成器,之前我要手动编写东西。

然而,我似乎找不到任何这样的生成 C++ 代码的解析器,也找不到正确处理 Unicode 的解析器。 (请注意,我的输入是 UCS-2 中的——我不关心支持基本多语言平面之外的位,如果这会使构建解析器变得更加困难)

有一些解析器可以生成 C,但这些解析器似乎都将异常安全抛出窗外,这将阻止我在任何语义操作中使用 C++。

是否存在满足这两个原则的解析器生成器,或者我是否坚持手工完成所有事情?

编辑:哦,我的项目是 BSL 许可的,所以对解析器生成器本身的输出的使用不会有太多限制。

After asking this question, I'm now sold on trying to use a parser generator, where before I was going to write things manually.

However, I can't seem to find any such parser that generates C++ code, nor can I find a parser that correctly handles Unicode. (note that my input is in UCS-2 -- I don't care about supporting bits outside of the Basic Multilingual Plane if that makes building the parser more difficult)

There are some parsers which can generate C, but such parsers all seem to throw exception safety out the window, which would prevent me from using C++ inside any semantic actions.

Does a parser generator exist which meets these two tenets, or am I stuck doing everything by hand?

EDIT: Oh, and my project is BSL licensed, so there can't be many restrictions on use of the output of the parser generator itself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

一杯敬自由 2024-10-12 10:09:10

C++中有两种方式。使用一个程序,根据以自由形式或使用模板编写的语法生成 C++ 文件。

当你用模板类型编写语法时,你有两种选择。使用 boost::proto,其中每个运算符都被重新定义以在 boost::fusion 中构建语法树(用于 boost::spirit、boost::msm、boost::xpressive)。 (基本思想在这里:表达式模板)或借助自己的模板手动构建表达式树并将其直接存储在 boost::mpl 容器中。该技术用于饼干。

在饼干中你有

or_<>, seq_<>, char_<>, ..

模板。 Biscuit 基于 Yard,但扩展了 boost::range 以获得更好的子匹配能力。

Biscuit 解析器库 1

Biscuit 解析器库 2

又一个 C++ 递归下降 (YARD) 解析框架

There are two way in C++. Using a program, that genereates C++ files from a grammar that is written in a free form or using templates.

And you have two choice when you writing a grammar in template types. Using the boost::proto, where every operator is redefinied to build a syntax tree in boost::fusion (used in boost::spirit, boost::msm, boost::xpressive). (basic idea is here:Expression Templates) or building an expression tree written by hand with the help of own templates and store it directly boost::mpl containers. This thecnique is used in biscuit.

In biscuit you have

or_<>, seq_<>, char_<>, ..

templates. Biscuit is based on Yard, but extended with an extended boost::range to get a better submatch capabaility.

The Biscuit Parser Library 1

The Biscuit Parser Library 2

Yet Another Recursive Descent (YARD) parsing framework for C++

定格我的天空 2024-10-12 10:09:10

好吧,这可能是一个远景,但有一个解析器生成器(LALR)作为 Qt 的一个副项目,它被称为 QLALR 这是一个非常薄的层,词法分析仍然取决于你,但所有工作都可以通过支持 unicode 的 QString 来完成。它没有太多的功能,您使用为每个标记执行工作的代码编写语法,它将为您生成解析器。但我已经使用它成功生成了一个具有约 100 条规则的解析器,创建了所解析语言的 AST。

Alright this might be a long shot but there is a parser generator (LALR) as a side project to Qt it is called QLALR it is a really thin layer, the lexing is still up to you, but all the work can be done through QStrings which support unicode. There is not a lot of functionality to it, you write the grammar with the code that does the work for each token, and it will generate the parser for you. But I have used it successfully generate a parser with ~100 rules, creating an AST of the language parsed.

不可一世的女人 2024-10-12 10:09:10

ANTLR 支持 Unicode。它具有 C++(以及 C、Java 和其他一些语言)支持,尽管我从未使用过 C++ 支持,所以我不确定它的开发程度如何。

ANTLR has Unicode support. It has C++ (and C, Java and a few other languages) support, though I've never used the C++ support so I'm not sure how well developed it is.

梦冥 2024-10-12 10:09:10

There appears to be preliminary support for unicode in boost::spirit

帅的被狗咬 2024-10-12 10:09:10

如果您想尝试一下,这个支持宽字符,但比较晦涩:http://wiki。 winprog.org/wiki/LibCC_Parse

if you're in the mood to experiment, this one supports wide chars but is obscure: http://wiki.winprog.org/wiki/LibCC_Parse

我不咬妳我踢妳 2024-10-12 10:09:10

解析器不关心字符,因为它处理标记。

Lex Unicode 非常昂贵。这是因为您要么为分类付出巨大的函数调用开销,要么用大量的表耗尽内存。通常,您只会在 PL 中的特定位置支持 Unicode,例如字符串文字,也许还有手工编写的函数可以有效完成工作的标识符。

我曾经在 Ocamllex 中编写了一个词法分析器,它接受 ISO C++ 标准规定的标识符(其中包括一组在各种语言中被视为“字母”的 Unicode 代码点范围)。虽然代码点范围的数量非常小(大约 20 个左右范围),但 UTF-8 DFA 具有超过 64K 状态并炸毁了词法分析器生成器:)

我的建议是:您必须手工制作词法分析器。事实上,这样做很容易效率低下。高效地做到这一点要困难得多:我会寻求 Judy 数组的支持(这是地球上最快的数据结构)。

The parser doesn't care about characters since it processes tokens.

Lexing Unicode is very expensive. This is because you either pay a huge function calling overhead for classification, or you kill your memory with massive tables. Normally you'd only support Unicode is specific places in a PL, such as string literals and perhaps identifiers where a handcrafted function can do the job efficiently.

I once coded in Ocamllex a lexer that would accept the identifiers mandated by the ISO C++ standard (which includes a set of ranges of Unicode code points considered as "letters" in various languages). Although the number of code point ranges is quite small (around 20 or so ranges), the UTF-8 DFA for this has over 64K states and blew up the lexer generator :)

My advice here is: you will have to hand craft your lexer. It is, in fact, very easy to do this inefficiently. Doing it efficiently is very much harder: I'd be looking at Judy arrays for support (this is the fastest data structure on the planet).

西瓜 2024-10-12 10:09:10

尝试Boost.Spirit。您可以插入自己的“流解码器”,它可以处理问题的 unicode 部分。让 Spritwchar_t 一起工作应该是可能的——尽管我自己还没有尝试过。

Try Boost.Spirit. You can plug-in your own "stream decoder", which handles the unicode-part of your problem. To make Sprit work with wchar_t should be possible -- although, I have not tried it myself.

dawn曙光 2024-10-12 10:09:10

我不知道很多关于解析器的理论,所以如果这不符合要求,请原谅我,但有 Ragel。

Ragel 生成状态机。它(也许是最著名的?)被 Ruby 的 Mongrel HTTP 服务器用来解析 HTTP 请求。

Ragel 的目标是普通 C(以及其他),但所有状态机数据要么是静态常量,要么是堆栈分配的,因此这应该可以缓解 C++ 异常的一些重要问题。如果需要特殊的异常处理,Ragel 不会回避暴露其内部结构。 (不像听起来那么复杂。)

Unicode 应该是可能的,因为输入是任何基本类型的数组,通常是 char,但可能是 shortint 在你的情况下。如果这不起作用,您甚至可以用您自己的机制替换数组迭代来获取下一个输入项/令牌/事件。

I don't know a whole lot of theory about parsers, so forgive me if this doesn't fit the bill, but there is Ragel.

Ragel generates state machines. It's (perhaps most famously?) used by the Mongrel HTTP server for Ruby to parse HTTP requests.

Ragel targets plain C (amongst others), but all of the state machine data is either static const or stack allocated, so that should alleviate some important concerns with C++ exceptions. If special exception handling is required, Ragel doesn't shy away from exposing its internals. (Not as complex as it may sound.)

Unicode should be possible, because input is an array of any basic type, usually char, but probably short or int in your case. If that doesn't do, you can even replace the array iteration with your own mechanism for getting the next input item/token/event.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文