C/C++:字符串常量指针的优化
看一下这段代码:
#include <iostream>
using namespace std;
int main()
{
const char* str0 = "Watchmen";
const char* str1 = "Watchmen";
char* str2 = "Watchmen";
char* str3 = "Watchmen";
cerr << static_cast<void*>( const_cast<char*>( str0 ) ) << endl;
cerr << static_cast<void*>( const_cast<char*>( str1 ) ) << endl;
cerr << static_cast<void*>( str2 ) << endl;
cerr << static_cast<void*>( str3 ) << endl;
return 0;
}
它产生如下输出:
0x443000
0x443000
0x443000
0x443000
这是在 Cygwin 下运行的 g++ 编译器上。 即使没有打开优化 (-O0
),这些指针也都指向同一位置。
编译器是否总是优化得如此之多以至于它会搜索所有字符串常量以查看它们是否相等? 这种行为可以依赖吗?
Have a look at this code:
#include <iostream>
using namespace std;
int main()
{
const char* str0 = "Watchmen";
const char* str1 = "Watchmen";
char* str2 = "Watchmen";
char* str3 = "Watchmen";
cerr << static_cast<void*>( const_cast<char*>( str0 ) ) << endl;
cerr << static_cast<void*>( const_cast<char*>( str1 ) ) << endl;
cerr << static_cast<void*>( str2 ) << endl;
cerr << static_cast<void*>( str3 ) << endl;
return 0;
}
Which produces an output like this:
0x443000
0x443000
0x443000
0x443000
This was on the g++ compiler running under Cygwin. The pointers all point to the same location even with no optimization turned on (-O0
).
Does the compiler always optimize so much that it searches all the string constants to see if they are equal? Can this behaviour be relied on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
它是不可靠的,它是一种优化,不属于任何标准。
我已将代码的相应行更改为:
-O0 优化级别的输出为:
但对于 -O1 来说,它是:
正如您所看到的,GCC(v4.1.2)在所有后续子字符串中重用了第一个字符串。 如何在内存中排列字符串常量是编译器的选择。
It can't be relied on, it is an optimization which is not a part of any standard.
I'd changed corresponding lines of your code to:
The output for the -O0 optimization level is:
But for the -O1 it's:
As you can see GCC (v4.1.2) reused first string in all subsequent substrings. It's compiler choice how to arrange string constants in memory.
这是一个非常简单的优化,可能是如此简单,以至于大多数编译器编写者根本不认为它是一种优化。 毕竟,将优化标志设置为最低级别并不意味着“完全天真”。
编译器在合并重复字符串文字方面的积极程度会有所不同。 他们可能会将自己限制为单个子例程 - 将这四个声明放在不同的函数中而不是单个函数中,您可能会看到不同的结果。 其他人可能会做整个编译单元。 其他人可能依赖链接器在多个编译单元之间进行进一步合并。
您不能依赖此行为,除非您的特定编译器的文档表明您可以这样做。 语言本身在这方面没有提出任何要求。 即使可移植性不是问题,我也会对在自己的代码中依赖它持谨慎态度,因为即使在单个供应商编译器的不同版本之间,行为也可能会发生变化。
It's an extremely easy optimization, probably so much so that most compiler writers don't even consider it much of an optimization at all. Setting the optimization flag to the lowest level doesn't mean "Be completely naive," after all.
Compilers will vary in how aggressive they are at merging duplicate string literals. They might limit themselves to a single subroutine — put those four declarations in different functions instead of a single function, and you might see different results. Others might do an entire compilation unit. Others might rely on the linker to do further merging among multiple compilation units.
You can't rely on this behavior, unless your particular compiler's documentation says you can. The language itself makes no demands in this regard. I'd be wary about relying on it in my own code, even if portability weren't a concern, because behavior is liable to change even between different versions of a single vendor's compiler.
您当然不应该依赖这种行为,但大多数编译器都会这样做。 任何文字值(“Hello”、42 等)都将存储一次,并且指向它的任何指针自然会解析为该单个引用。
如果您发现需要依赖它,那么请确保安全并重新编码如下:
You surely should not rely on that behavior, but most compilers will do this. Any literal value ("Hello", 42, etc.) will be stored once, and any pointers to it will naturally resolve to that single reference.
If you find that you need to rely on that, then be safe and recode as follows:
当然,你不应该指望这一点。 优化器可能会对你做一些棘手的事情,并且应该允许它这样做。
然而,这种情况非常很常见。 我记得早在 87 年,一位同学正在使用 DEC C 编译器,并遇到了一个奇怪的错误,他所有的文字 3 都变成了 11(数字可能已更改以保护无辜者)。 他甚至执行了
printf ("%d\n", 3)
并打印了11。
他把我叫过去,因为这太奇怪了(为什么这会让人们思考)我?),经过大约 30 分钟的绞尽脑汁,我们找到了原因。 这行代码大致如下:
注意单个“=”字符。 是的,那是一个错字。 编译器有一个小错误并允许这样做。 其效果是将整个程序中的所有字面值 3 变成当时 x 中的值。
无论如何,很明显 C 编译器将所有文字 3 放在同一个地方。 如果 80 年代的 C 编译器能够做到这一点,那也不会太难。 我预计它会很常见。
You shouldn't count on that of course. An optimizer might do something tricky on you, and it should be allowed to do so.
It is however very common. I remember back in '87 a classmate was using the DEC C compiler and had this weird bug where all his literal 3's got turned into 11's (numbers may have changed to protect the innocent). He even did a
printf ("%d\n", 3)
and it printed11.
He called me over because it was so weird (why does that make people think of me?), and after about 30 minutes of head scratching we found the cause. It was a line roughly like this:
Note the single "=" character. Yes, that was a typo. The compiler had a wee bug and allowed this. The effect was to turn all his literal 3's in the entire program into whatever happened to be in x at the time.
Anyway, its clear the C compiler was putting all literal 3's in the same place. If a C compiler back in the 80's was capable of doing this, it can't be too tough to do. I'd expect it to be very common.
我不会依赖这种行为,因为我怀疑 C 或 C++ 标准是否会明确这种行为,但编译器这样做是有道理的。 即使没有为编译器指定任何优化,它也会表现出这种行为,这也是有道理的; 其中没有任何权衡。
C 或C++ 中的所有字符串文字(例如“字符串文字”)都是只读的,因此是常量。 当您说:
从某种意义上说,您正在将字符串向下转型为非常量类型。 然而,您无法取消字符串的只读属性:如果您尝试操作它,您将在运行时而不是编译时被捕获。 (这实际上是在将字符串文字分配给变量时使用 const char * 的一个很好的理由。)
I would not rely on the behavior, because I am doubtful the C or C++ standards would make explicit this behavior, but it makes sense that the compiler does it. It also makes sense that it exhibits this behavior even in the absence of any optimization specified to the compiler; there is no trade-off in it.
All string literals in C or C++ (e.g. "string literal") are read-only, and thus constant. When you say:
You are in a sense downcasting the string to a non-const type. Nevertheless, you can't do away with the read-only attribute of the string: if you try to manipulate it, you'll be caught at run-time rather than at compile-time. (Which is actually a good reason to use
const char *
when assigning string literals to a variable of yours.)不,不能依赖它,但是将只读字符串常量存储在池中是一种非常简单且有效的优化。 只需存储按字母顺序排列的字符串列表,然后将它们输出到最后的目标文件中即可。 想想平均代码库中有多少个“\n”或“”常量。
如果编译器想要获得额外的花哨,它也可以重复使用后缀:“\n”可以通过指向“Hello\n”的最后一个字符来表示。 但这对于复杂性的显着增加来说可能没有什么好处。
不管怎样,我不相信该标准说明了任何东西的真正存储位置。 这将是一个非常特定于实现的事情。 如果您将其中两个声明放在单独的 .cpp 文件中,那么情况也可能会发生变化(除非您的编译器做了重要的链接工作。)
No, it can't be relied on, but storing read-only string constants in a pool is a pretty easy and effective optimization. It's just a matter of storing an alphabetical list of strings, and then outputting them into the object file at the end. Think of how many "\n" or "" constants are in an average code base.
If a compiler wanted to get extra fancy, it could re-use suffixes too: "\n" can be represented by pointing to the last character of "Hello\n". But that likely comes with very little benifit for a significant increase in complexity.
Anyway, I don't believe the standard says anything about where anything is stored really. This is going to be a very implementation-specific thing. If you put two of those declarations in a separate .cpp file, then things will likely change too (unless your compiler does significant linking work.)