如何在 C++ 中高效实现异构不可变对象的不可变图？

发布于 2024-09-29 09:54:05 字数 968 浏览 7 评论 0原文

出于好奇，我正在编写一个编程语言文本解析器。假设我想将令牌的不可变（在运行时）图定义为顶点/节点。它们自然具有不同的类型 - 有些标记是关键字，有些是标识符等。但是它们都具有共同的特征，即图中的每个标记都指向另一个标记。这个属性让解析器知道特定标记后面可能是什么 - 因此该图定义了该语言的形式语法。我的问题是，几年前我不再每天使用 C++，从那时起我使用了很多高级语言，而我的头脑在堆分配、堆栈分配等方面完全支离破碎。唉，我的C++生锈了。

尽管如此，我还是想立即爬上陡峭的山坡，并为自己设定目标，以最高效的方式用这种命令式语言定义该图。例如，我想避免使用“new”在堆上单独分配每个令牌对象，因为我认为如果我背靠背分配这些令牌的整个图（可以说以线性方式，如数组中的元素），这将以某种方式有利于性能，每个参考原则的位置 - 我的意思是，当整个图被压缩以沿着内存中的“线”占用最小空间时，而不是将其所有令牌对象放在随机位置，这是一个优点？不管怎样，正如你所看到的，这是一个非常开放的问题。

class token
{

}

class word: token
{
    const char* chars;

    word(const char* s): chars(s)
    {
    }
}

class ident: token
{
    /// haven't thought about these details yet
}

template<int N> class composite_token: token
{
    token tokens[N];
}

class graph
{
    token* p_root_token;
}

直接的问题是：创建这个图形对象的过程是什么？它是不可变的，并且它的思想结构在编译时是已知的，这就是为什么我可以并且希望避免按值复制内容等等 - 应该可以用文字组成这个图？我希望我在这里说得有道理......（这不是我第一次没有这样做。）该图将在运行时被解析器用作编译器的一部分。正因为这是 C++，我也会对 C 解决方案感到满意。预先非常感谢您。

原文

I am writing a programming language text parser, out of curiosity. Say i want to define an immutable (at runtime) graph of tokens as vertices/nodes. These are naturally of different type - some tokens are keywords, some are identifiers, etc. However they all share the common trait where each token in the graph points to another. This property lets the parser know what may follow a particular token - and so the graph defines the formal grammar of the language. My problem is that I stopped using C++ on a daily basis some years ago, and used a lot of higher level languages since then and my head is completely fragmented with regards to heap-allocation, stack-allocation and such. Alas, my C++ is rusty.

Still, I would like to climb the steep hill at once and set for myself the goal of defining this graph in this imperative language in a most performant way. For instance I want to avoid allocating each token object separately on the heap using 'new' because I think if I allocate the entire graph of these tokens back-to-back so to speak (in a linear fashion like elements in an array), this would benefit the performance somehow, per locality of reference principle - I mean when the entire graph is compacted to take up minimal space along a 'line' in memory, rather than having all its token objects at random locations, that is a plus? Anyway, like you see, this is a bit of a very open question.

class token
{

}

class word: token
{
    const char* chars;

    word(const char* s): chars(s)
    {
    }
}

class ident: token
{
    /// haven't thought about these details yet
}

template<int N> class composite_token: token
{
    token tokens[N];
}

class graph
{
    token* p_root_token;
}

The immediate question is: what would be the procedure to create this graph object? It's immutable and it's thought structure is known at compile time, that's why I can and want to avoid copying stuff by value and so on - it should be possible to compose this graph out of literals? I hope I am making sense here... (wouldn't be the first time I didn't.) The graph will be used by the parser at runtime as part of a compiler. And just because this is C++, I would be happy with a C solution as well. Thank you very much in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蘑菇王子 2024-10-06 09:54:06

我不认为许多小的令牌分配会成为瓶颈，如果是的话，你总是可以选择内存池。

直击问题；由于所有标记都具有相似的数据（具有指向下一个标记的指针，也许还有我们正在处理的标记的一些枚举值），因此您可以将相似的数据放入一个 std::vector 中。这将是内存中的连续数据，并且循环非常有效。

在循环时，您可以检索所需的信息类型。我敢打赌，令牌本身理想情况下只包含“动作”（成员函数），例如：如果前一个和下一个令牌是数字，并且 I 是加号，我们应该将数字加在一起。

因此，数据存储在一个中心位置，分配令牌（但实际上本身可能不包含太多数据）并在中心位置处理数据。这实际上是一种面向数据的设计。

该向量可能看起来像：

struct TokenData
{
    token *previous, *current, *next;
    token_id id; // some enum?
    ... // more data that is similar
}

std::vector<TokenData> token_data;

class token
{
    std::vector<TokenData> *token_data;
    size_t index;

    TokenData &data()
    {
        return (*token_data)[index];
    }

    const TokenData &data() const
    {
        return (*token_data)[index];
    }
}

// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)

for (size_t i = 0; i < token_data.size(); i++)
{
    token_data[i].current->do_work();
}

这是一个想法。

I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool.

Onto the problem; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. This will be continuous data in memory, and very efficient to loop over.

While looping, you retrieve the kind of information you need. I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together.

So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. This is actually a data-oriented design.

The vector could look like:

struct TokenData
{
    token *previous, *current, *next;
    token_id id; // some enum?
    ... // more data that is similar
}

std::vector<TokenData> token_data;

class token
{
    std::vector<TokenData> *token_data;
    size_t index;

    TokenData &data()
    {
        return (*token_data)[index];
    }

    const TokenData &data() const
    {
        return (*token_data)[index];
    }
}

// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)

for (size_t i = 0; i < token_data.size(); i++)
{
    token_data[i].current->do_work();
}

It's an idea.

回复收藏 0 原文