如何在 C++ 中高效实现异构不可变对象的不可变图?
出于好奇,我正在编写一个编程语言文本解析器。假设我想将令牌的不可变(在运行时)图定义为顶点/节点。它们自然具有不同的类型 - 有些标记是关键字,有些是标识符等。但是它们都具有共同的特征,即图中的每个标记都指向另一个标记。这个属性让解析器知道特定标记后面可能是什么 - 因此该图定义了该语言的形式语法。我的问题是,几年前我不再每天使用 C++,从那时起我使用了很多高级语言,而我的头脑在堆分配、堆栈分配等方面完全支离破碎。唉,我的C++生锈了。
尽管如此,我还是想立即爬上陡峭的山坡,并为自己设定目标,以最高效的方式用这种命令式语言定义该图。例如,我想避免使用“new”在堆上单独分配每个令牌对象,因为我认为如果我背靠背分配这些令牌的整个图(可以说以线性方式,如数组中的元素),这将以某种方式有利于性能,每个参考原则的位置 - 我的意思是,当整个图被压缩以沿着内存中的“线”占用最小空间时,而不是将其所有令牌对象放在随机位置,这是一个优点?不管怎样,正如你所看到的,这是一个非常开放的问题。
class token
{
}
class word: token
{
const char* chars;
word(const char* s): chars(s)
{
}
}
class ident: token
{
/// haven't thought about these details yet
}
template<int N> class composite_token: token
{
token tokens[N];
}
class graph
{
token* p_root_token;
}
直接的问题是:创建这个图形对象的过程是什么?它是不可变的,并且它的思想结构在编译时是已知的,这就是为什么我可以并且希望避免按值复制内容等等 - 应该可以用文字组成这个图?我希望我在这里说得有道理......(这不是我第一次没有这样做。)该图将在运行时被解析器用作编译器的一部分。正因为这是 C++,我也会对 C 解决方案感到满意。预先非常感谢您。
I am writing a programming language text parser, out of curiosity. Say i want to define an immutable (at runtime) graph of tokens as vertices/nodes. These are naturally of different type - some tokens are keywords, some are identifiers, etc. However they all share the common trait where each token in the graph points to another. This property lets the parser know what may follow a particular token - and so the graph defines the formal grammar of the language. My problem is that I stopped using C++ on a daily basis some years ago, and used a lot of higher level languages since then and my head is completely fragmented with regards to heap-allocation, stack-allocation and such. Alas, my C++ is rusty.
Still, I would like to climb the steep hill at once and set for myself the goal of defining this graph in this imperative language in a most performant way. For instance I want to avoid allocating each token object separately on the heap using 'new' because I think if I allocate the entire graph of these tokens back-to-back so to speak (in a linear fashion like elements in an array), this would benefit the performance somehow, per locality of reference principle - I mean when the entire graph is compacted to take up minimal space along a 'line' in memory, rather than having all its token objects at random locations, that is a plus? Anyway, like you see, this is a bit of a very open question.
class token
{
}
class word: token
{
const char* chars;
word(const char* s): chars(s)
{
}
}
class ident: token
{
/// haven't thought about these details yet
}
template<int N> class composite_token: token
{
token tokens[N];
}
class graph
{
token* p_root_token;
}
The immediate question is: what would be the procedure to create this graph object? It's immutable and it's thought structure is known at compile time, that's why I can and want to avoid copying stuff by value and so on - it should be possible to compose this graph out of literals? I hope I am making sense here... (wouldn't be the first time I didn't.) The graph will be used by the parser at runtime as part of a compiler. And just because this is C++, I would be happy with a C solution as well. Thank you very much in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不认为许多小的令牌分配会成为瓶颈,如果是的话,你总是可以选择内存池。
直击问题;由于所有标记都具有相似的数据(具有指向下一个标记的指针,也许还有我们正在处理的标记的一些枚举值),因此您可以将相似的数据放入一个 std::vector 中。这将是内存中的连续数据,并且循环非常有效。
在循环时,您可以检索所需的信息类型。我敢打赌,令牌本身理想情况下只包含“动作”(成员函数),例如:如果前一个和下一个令牌是数字,并且 I 是加号,我们应该将数字加在一起。
因此,数据存储在一个中心位置,分配令牌(但实际上本身可能不包含太多数据)并在中心位置处理数据。这实际上是一种面向数据的设计。
该向量可能看起来像:
这是一个想法。
I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool.
Onto the problem; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. This will be continuous data in memory, and very efficient to loop over.
While looping, you retrieve the kind of information you need. I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together.
So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. This is actually a data-oriented design.
The vector could look like:
It's an idea.
我的 C++ 也很生疏,所以我可能不知道最好的解决方案。但由于没有其他人挺身而出......
你是对的,将所有节点分配在一个块中会给你最好的局部性。但是,如果您在程序启动时动态分配图,则您的堆分配也可能会紧密地聚集在一起。
要在单个内存块中分配所有节点,我想到了两种可能性:创建并填充 Vector<> 。在启动时(缺点是现在内存中有两次图形信息),或者使用静态数组初始值设定项“Node[] graph = { ... };” 。
对于任何一种方法,最大的障碍是您想要创建异构对象图。一个明显的解决方案是“不要”:您可以使节点成为所有可能字段的超集,并使用显式“类型”成员来区分类型。
如果要保留各种节点类,则必须使用多个数组/向量:每种类型一个。
无论哪种方式,节点之间的连接都必须根据数组索引进行初始定义(Node[3] 后面跟着 Node[10])。当然,为了获得更好的解析性能,您可以在程序启动时基于这些索引创建直接对象指针。
我不会将文字字符串放入任何节点(在您的情况下为“单词”):关键字、标识符和其他词汇元素的识别应该在与解析器分开的词法分析器模块中完成。我认为,如果您在终端学中区分词法分析器根据程序输入生成的标记和程序用于解析输入的语法图节点,这也会有所帮助。
我希望这有帮助。
My C++ is rusty as well, so I probably don't know the best solution for this. But since nobody else stepped forward...
You are right in that allocating all nodes in one block would give you the best locality. However, if you dynamically allocate the graph at program start, chances are that your heap allocations will also cluster together closely.
To allocate all nodes in a single memory block, two possibilities come to my mind: create and populate a Vector<> at startup (with the drawback that now you have the graph information twice in memory), or use a static array initializer "Node[] graph = { ... };" .
For either approach, the biggest obstacle is that you want to create your graph of heterogenous objects. One obvious solution is "Don't": you could make your node a superset of all possible fields, and distinguishing the types with an explicit 'type' member.
If you want to keep the various node classes, you will have to use multiple arrays/vectors: one for each type.
Either way, the connections between the nodes will have to be initially defined in terms of array indices (Node[3] is followed by Node[10]). For better parsing performance, you could create direct object pointers at program startup based on these indices, of course.
I would not put literal strings into any node ('word' in your case): the recognition of keywords, identifiers and other lexical elements should be done in a lexer module separate from the parser. I think it would also help if you distinguish in terminalogy between the tokens generated by the Lexer based on the program's input, and the grammar graph nodes your program uses to parse the input.
I hope this helps.
我不明白您将如何定义标记的“图”来定义任何实用编程语言的语法,特别是如果标记之间的关系是“允许遵循”的话。
表示编程语言语法的常用方法是使用 巴科斯-诺尔范式 (BNF) 或该术语的扩展版本,称为“EBNF”。
如果您想表示 EBNF(“作为不可变图”),这个 SO 答案讨论了如何在 C# 中做到这一点。这些想法在 C++ 中有直接的相似之处。
坏消息是大多数解析引擎不能直接使用 EBNF,因为它在实践中效率太低。使用语法规则的直接表示来构建高效的解析器是很困难的;这就是人们发明解析器生成器的原因。因此,除非您打算编写一个解析器生成器,否则不清楚是否需要将这些规则放入内存结构中,更不用说“高效”的规则了。
最后,即使您确实以某种方式最佳地打包了语法信息,它也可能不会对实际性能产生丝毫差异。解析器的大部分时间都花在对词位中的字符进行分组上,有时甚至只是进行空白抑制。
I don't see how you will define a "graph" of tokens that defines the syntax of any practical programming language, especially if the relation betweens tokens is "allowed-to-follow".
The usual way to represent the grammar of programming language is using Backus-Naur Form (BNF) or Extended versions of this termed "EBNF".
If you wanted to represent an EBNF ("as an immutable graph"), this SO answer discusses how to do that in C#. The ideas have direct analogs in C++.
The bad news is that most parsing engines can't use the EBNF for directly because it is simply too inefficient in practice. It is hard to build an efficient parser using the direct representation of the grammar rules; this is why people invented parser generators. So the need to put these rules into a memory structure at all, let alone an "efficient" one, is unclear unless you intend to write a parser generator.
Finally, even if you do pack the grammar-information somehow optimally, it probably won't make an ounce of difference in actual performance. Most of a parser's time is spent in grouping characters in lexemes, sometime even to the point of just doing blank supression.