编译器通常对字符串有特殊的优化吗?
很多时候,您会看到类似
std::map<std::string, somethingelse> m_named_objects;
或
std::string state;
//...
if(state == "EXIT")
exit();
else if(state == "california")
hot();
人们使用字符串纯粹是为了使内容更具可读性。使用整数 ID 之类的东西可以轻松实现同样的事情。
现代编译器(msvc、g++ 等)通常可以针对这些类型的情况采用特殊优化吗?或者由于性能不佳或其他原因应该避免这种情况?
Often times you see things like
std::map<std::string, somethingelse> m_named_objects;
or
std::string state;
//...
if(state == "EXIT")
exit();
else if(state == "california")
hot();
where people use strings purely to make something more readable. The same thing could easily be achieved with something like integer-IDs.
Can modern compilers (msvc, g++, etc.) usually employ special optimizations for these types of cases? Or should this be avoided because of bad performance or for other reasons?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
据我所知,编译器不会进行此类优化。这绝对不是“标准”优化。
至少对于第二种情况,在我看来,枚举更具可读性并且速度更快(因为整数比较相对于字符串比较来说相当便宜)。
As far as I know, compilers don't make those kinds of optimizations. It's definitely not a "standard" optimization.
At least for your second case, it seems to me that enumerations are more readable and can be faster (since integer comparisons are rather cheap relative to string comparison).
图书馆有。
编译器可以通过别名共享/相同的静态字符串来优化(假设它们确实被视为常量)。
我目前所知的所有 C++ 标准库实现都支持“小字符串优化”,这意味着小字符串不需要进行额外的堆分配;即,
将完全自动(堆栈)分配 - 在高度优化的情况下,甚至可能注册分配(?)
如果您需要极快的字符串查找并且可以花一些时间构建数据结构,请查看 Tries(WP:Trie, Radix_tree)
就直接替换而言通常可以通过使用适当调整的哈希映射而不是基于 RB 树的:
替换为
Be happy
Libraries do.
Compilers might optimize by aliasing shared/identical static strings (assuming that they really are treated as constants).
All C++ standard library implementation I'm currently aware of, sport a 'small string optimization', meaning that no extra heap allocation needs to occur for small strings; I.e.
will be fully auto (stack) allocated - in highly optimized cases perhaps even register allocated(?)
If you need blazingly fast string lookups and can afford some time spent building your datastructure, look at Tries (WP: Trie, Radix_tree)
As far as drop-in replacements go usually a lot can be gained by using a properly tuned hash map instead of a RB-tree based one:
replace by
Be happy
在给出的示例中,编译器通常无法优化,因为内容取决于运行时。
std::map
不具备std::string
上的operator<()
最理想的性能特征code> 相对昂贵。In the examples given the compiler generally cannot optimize because the content is runtime dependent.
std::map<std::string, int>
does not have the most desirable performance characteristics asoperator<()
on astd::string
is relatively expensive.字符串的优化是针对库的,而不是针对编译器的。如果您想要类似字符串的标识符,枚举是一种可能。但更好的一个,特别是对于打印和调试来说,是固定长度的标识符字符串类。
它将可转换为
const char *
和std::string
,但内存分配为零。相反,它只是 32 字符(或任何您想要的)数组的包装。最好的部分是,由于它是一个标识符,因此您不必关心 ASCII 逐个字符的比较。
operator<
只能将 32 字符数组读取为 8 个uint32_t
,甚至 4 个uint64_t
。您所需要的只是订购,而不是特定的订购。operator==
可以做类似的测试。这是一个写起来非常简单的类。如果您想要不区分大小写的比较,只需在将字符串复制到对象中时将其转换为小写即可。
如果您需要长度超过 31 个字节的字符串(一个用于
\0
终止符),那么我建议将字符串截断到一定大小。但从给定字符串的中间开始截断,而不是结尾。标识符的开头和结尾往往比中间更独特。您甚至可以在截断的字符串中放入一些特殊字符来标识它是截断的版本。也可以采用这个想法并将哈希值放入字符串中。因此,前 4 个字节将是原始字符串的哈希值,而不是截断的哈希值。比较测试只使用哈希值,其他 28 个字节是为了使其易于人类阅读。
Optimizations for strings are for libraries, not compilers. If you want string-like identifiers, enums are one possibility. But a better one, particularly for printing and debugging, is a fixed-length identifier string class.
It would be convertible to
const char *
andstd::string
, but it would have zero memory allocations. Instead, it would just be a wrapper around a 32-character (or whatever you want) array.The best part is that, since it's an identifier, you don't care about ASCII character-by-character comparisons.
operator<
can just read the 32-character array as 8uint32_t
s, or even as 4uint64_t
s. All you need is an ordering, not a specific ordering.operator==
can do similar tests.It's a pretty simple class to write. If you want case-insensitive comparisons, you could just convert the string to lowercase when you copy it into the object.
If you need strings longer than 31 bytes (one for the
\0
terminator), then I would suggest truncating the string down to size. But truncate from the middle of the given string, not the end. The beginnings and end of identifiers tend to be more unique than the middle. You could even put some special characters in a truncated string to identify that it is a truncated version.It is also possible to take this idea and put a hash in the string. So the first 4 bytes would be a hash of the original string, not of the truncation. Comparison tests would just use the hash, and the other 28 bytes are there to make it human-readable.