独特的合成名称

发布于 2024-10-09 13:39:44 字数 1190 浏览 3 评论 0原文

我想在 C++ 中生成具有唯一确定性名称的各种数据类型。例如:

struct struct_int_double { int mem0; double mem1; };

目前我的编译器使用计数器合成名称,这意味着在不同的翻译单元中编译相同的数据类型时名称不一致。

以下是行不通的方法:

  1. 使用 ABI mangled_name 函数。因为它已经依赖于具有唯一名称的结构。通过假装结构是匿名的,可以在符合 C++11 的 ABI 中工作吗?

  2. 模板,例如 struct2,因为模板不适用于递归类型。

  3. 彻底的破坏。因为它给出的名称太长(数百个字符!)

除了全局注册表(YUK!)之外,我唯一能想到的就是首先创建一个唯一的长损坏名称,然后使用摘要或哈希函数来缩短它(并希望没有冲突)。

实际问题:生成可以在匿名类型(例如元组、求和类型、函数类型)的情况下调用的库。

还有其他想法吗?

编辑:递归类型问题的附加描述。考虑定义一个这样的链表:

template<class T>
typedef pair<list<T>*, T> list;

这实际上是所需要的。它不起作用有两个原因:首先,你不能模板化 typedef。 [不,您不能使用其中包含 typedef 的模板类,它不起作用] 其次,您不能将 list* 作为参数传递,因为它尚未定义。在没有多态性的 C 中,你可以做到这一点:

struct list_int { struct list_int *next; int value; };

有几种解决方法。对于这个特定问题,您可以使用 Barton-Nackman 技巧的变体,但它不能概括。

有一个通用的解决方法,首先由 Gabrielle des Rois 向我展示,使用具有开放递归的模板,然后使用部分专业化来关闭它。但这是非常难以生成的,即使我能弄清楚如何做到这一点,也可能无法读取。

正确处理变体还存在另一个问题,但这并不直接相关(这只是更糟糕,因为对声明与可构造类型的联合的愚蠢限制)。

因此,我的编译器只是使用普通的 C 类型。无论如何,它必须处理多态性:编写它的原因之一是为了绕过包括模板在内的 C++ 类型系统的问题。这就会导致命名问题。

I would like to generate various data types in C++ with unique deterministic names. For example:

struct struct_int_double { int mem0; double mem1; };

At present my compiler synthesises names using a counter, which means the names don't agree when compiling the same data type in distinct translation units.

Here's what won't work:

  1. Using the ABI mangled_name function. Because it depends already on structs having unique names. Might work in C++11 compliant ABI by pretending struct is anonymous?

  2. Templates eg struct2 because templates don't work with recursive types.

  3. A complete mangling. Because it gives names which are way too long (hundreds of characters!)

Apart from a global registry (YUK!) the only thing I can think of is to first create a unique long mangled name, and then use a digest or hash function to shorten it (and hope there are no clashes).

Actual problem: to generate libraries which can be called where the types are anonymous, eg tuples, sum types, function types.

Any other ideas?

EDIT: Addition description of recursive type problem. Consider defining a linked list like this:

template<class T>
typedef pair<list<T>*, T> list;

This is actually what is required. It doesn't work for two reasons: first, you can't template a typedef. [NO, you can NOT use a template class with a typedef in it, it doesn't work] Second, you can't pass in list* as an argument because it isn't defined yet. In C without polymorphism you can do it:

struct list_int { struct list_int *next; int value; };

There are several work arounds. For this particular problem you can use a variant of the Barton-Nackman trick, but it doesn't generalise.

There is a general workaround, first shown me by Gabrielle des Rois, using a template with open recursion, and then a partial specialisation to close it. But this is extremely difficult to generate and would probably be unreadable even if I could figure out how to do it.

There's another problem doing variants properly too, but that's not directly related (it's just worse because of the stupid restriction against declaring unions with constructable types).

Therefore, my compiler simply uses ordinary C types. It has to handle polymorphism anyhow: one of the reasons for writing it was to bypass the problems of C++ type system including templates. This then leads to the naming problem.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

长梦不多时 2024-10-16 13:39:44

你真的需要名字一致吗?只需在不同的翻译单元中使用不同的名称分别定义结构体,并在必要时使用 reinterpret_cast 来保持 C++ 编译器的满意。当然,这对于手写代码来说是可怕的,但这是由编译器生成的代码,因此您可以(并且我假设这样做)在生成 C++ 代码之前执行必要的静态类型检查。

如果我错过了一些东西,并且您确实需要类型名称一致,那么我认为您已经回答了自己的问题:除非编译器可以在多个翻译单元的翻译之间共享信息(通过某些全局注册表),否则我可以'除了明显的名称修改之外,没有看到任何从类型的结构形式生成唯一的、确定性名称的方法。

至于名字的长度,我不确定为什么它很重要?如果您正在考虑使用哈希函数来缩短名称,那么显然您不需要它们是人类可读的,那么为什么它们需要简短呢?

就我个人而言,我可能会生成半人类可读的名称,其风格与现有的名称修饰方案类似,而不用担心哈希函数。因此,您可以生成 sid (struct, int, double) 或 si32f64 (struct,32 位整数,64-位浮动)或其他什么。像这样的名称的优点是它们仍然可以直接解析(这似乎对于调试来说非常重要)。

编辑

更多想法:

  • 模板:我认为生成模板代码来解决这个问题没有任何真正的优势,即使它是可能的。如果您担心在链接器中达到符号名称长度限制,模板无法帮助您,因为链接器没有模板的概念:它看到的任何符号都将是 C++ 编译器生成的模板结构的损坏形式,并且将与 felix 编译器直接生成的长损坏名称具有完全相同的问题。
  • 在 felix 代码中命名的任何类型都应保留并在生成的 C++ 代码中直接(或几乎直接)使用。我认为 felix 代码中使用的匿名类型的复杂性存在实际(软)可读性/可维护性约束,这是您唯一需要为其生成名称的类型。我假设你的“变体”是有区别的联合,所以每个组成部分必须有一个在 felix 代码中定义的名称(标签),并且这些名称可以再次保留。 (我在评论中提到了这一点,但由于我正在编辑我的答案,所以我也可以将其包括在内)
  • 减少损坏名称长度:通过哈希函数运行长损坏名称听起来是最简单的方法,并且机会只要您使用良好的散列函数并在散列名称中保留足够的位(并且用于编码散列名称的字母表有 37 个字符,因此完整的 160 位 sha1 散列可以用大约 31 个字符编写),冲突的发生率应该是可以接受的)。哈希函数的想法意味着您将无法直接从哈希名称返回到原始名称,但您可能永远不需要这样做。我猜你可以转储辅助名称映射表作为编译过程的一部分(或者可能从 C 结构定义重新生成名称,如果有的话)。或者,如果您仍然不喜欢哈希函数,您可能可以定义一个相当紧凑的位级编码(然后将其写入 37 个字符的标识符字母表中),或者甚至在该位级上运行一些通用压缩算法编码。如果你有足够的 felix 代码来分析,你甚至可以预先生成一个固定的压缩字典。当然,这绝对是疯狂的:只使用哈希值。

编辑2:抱歉,脑残——sha-1摘要是160位,而不是128位


。PS。不知道为什么这个问题被否决了——对我来说这似乎是合理的,尽管有关您正在开发的这个编译器的更多上下文可能会有所帮助。

Do you actually need the names to agree? Just define the structs separately, with different names, in the different translation units and reinterpret_cast<> where necessary to keep the C++ compiler happy. Of course that would be horrific in hand-written code, but this is code generated by your compiler, so you can (and I assume do) perform the necessary static type checks before the C++ code is generated.

If I've missed something and you really do need the type names to agree, then I think you already answered your own question: Unless the compiler can share information between the translation of multiple translation units (through some global registry), I can't see any way of generating unique, deterministic names from the type's structural form except the obvious one of name-mangling.

As for the length of names, I'm not sure why it matters? If you're considering using a hash function to shorten the names then clearly you don't need them to be human-readable, so why do they need to be short?

Personally I'd probably generate semi-human-readable names, in a similar style to existing name-mangling schemes, and not bother with the hash function. So, instead of generating struct_int_double you might generate sid (struct, int, double) or si32f64 (struct, 32-bit integer, 64-bit float) or whatever. Names like that have the advantage that they can still be parsed directly (which seems like it would be pretty much essential for debugging).

Edit

Some more thoughts:

  • Templates: I don't see any real advantage in generating template code to get around this problem, even if it were possible. If you're worried about hitting symbol name length limits in the linker, templates can't help you, because the linker has no concept of templates: any symbols it see will be mangled forms of the template structure generated by the C++ compiler and will have exactly the same problem as long mangled names generated directly by the felix compiler.
  • Any types that have been named in felix code should be retained and used directly (or nearly directly) in the generated C++ code. I would think there are practical (soft) readability/maintainability constraints on the complexity of anonymous types used in felix code, which are the only ones you need to generate names for. I assume your "variants" are discriminated unions, so each component part must have a name (the tag) defined in the felix code, and again these names can be retained. (I mentioned this in a comment, but since I'm editing my answer I might as well include it)
  • Reducing mangled-name length: Running a long mangled name through a hash function sounds like the easiest way to do it, and the chance of collisions should be acceptable as long as you use a good hash function and retain enough bits in your hashed name (and your alphabet for encoding the hashed name has 37 characters, so a full 160-bit sha1 hash could be written in about 31 characters). The hash function idea means that you won't be able to get directly back from a hashed name to the original name, but you might never need to do that. And you could dump out an auxiliary name-mapping table as part of the compilation process I guess (or re-generate the name from the C struct definition maybe, where it's available). Alternatively, if you still really don't like hash functions, you could probably define a reasonably compact bit-level encoding (then write that in the 37-character identifier alphabet), or even run some general purpose compression algorithm on that bit-level encoding. If you have enough felix code to analyse you could even pre-generate a fixed compression dictionary. That's stark raving bonkers of course: just use a hash.

Edit 2: Sorry, brain failure -- sha-1 digests are 160 bits, not 128.


PS. Not sure why this question was down-voted -- it seems reasonable to me, although some more context about this compiler you're working on might help.

岁月静好 2024-10-16 13:39:44

我不太明白你的问题。

template<typename T>
struct SListItem
{
    SListItem* m_prev;
    SListItem* m_next;
    T m_value;
};

int main()
{
    SListItem<int> sListItem;
}

I don't really understand your problem.

template<typename T>
struct SListItem
{
    SListItem* m_prev;
    SListItem* m_next;
    T m_value;
};

int main()
{
    SListItem<int> sListItem;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文