编译器可以优化静态局部变量的初始化吗?
在以下情况下应该有什么行为:
class C {
boost::mutex mutex_;
std::map<...> data_;
};
C& get() {
static C c;
return c;
}
int main() {
get(); // is compiler free to optimize out the call?
....
}
是否允许编译器优化对 get()
的调用?
这个想法是在需要多线程操作之前触摸静态变量来初始化它,
这是一个更好的选择吗?:
C& get() {
static C *c = new C();
return *c;
}
what should be the behavior in the following case:
class C {
boost::mutex mutex_;
std::map<...> data_;
};
C& get() {
static C c;
return c;
}
int main() {
get(); // is compiler free to optimize out the call?
....
}
is compiler allowed to optimize out the call to get()
?
the idea was to touch static variable to initialize it before multithreaded operations needed it
is this a better option?:
C& get() {
static C *c = new C();
return *c;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
更新 (2023) 答案:
在 C++23 (N4950) 中,初始化静态局部变量的任何副作用在进入其包含块时都是可见的。因此,除非编译器可以确定初始化变量没有明显的副作用,否则它将必须生成代码以在适当的时间调用
get()
(或执行get()
的内联版本code>get(),视情况而定)。与早期标准相反,C++ 23 不再允许“提前”完成静态局部变量的动态初始化(如下所述)。
[stmt.dcl]/3:
原始(2010)答案:
C 和 C++ 标准在一个相当简单的原则下运行,通常称为“as-if 规则”——基本上,只要没有一致的代码可以辨别差异,编译器就可以自由地执行几乎任何操作在它所做的事情和官方要求的事情之间。
我没有看到一种方法可以使代码一致来辨别在这种情况下是否实际调用了 get ,因此在我看来可以自由地对其进行优化。
至少在 N4296 中,该标准包含对静态局部变量进行早期初始化的显式许可:
因此,根据此规则,局部变量的初始化可以在执行的早期任意发生,因此即使它具有明显的副作用,它们也可以在任何尝试观察它们的代码之前发生。因此,您不能保证看到它们,因此允许对其进行优化。
Updated (2023) Answer:
In C++23 (N4950) any side effects of initializing a static local variable are observable as its containing block is entered. As such, unless the compiler can determine that initializing the variable has no visible side effects, it will have to generate code for to call
get()
at the appropriate time (or to execute an inlined version ofget()
, as the case may be).Contrary to earlier standards, C++ 23 no longer gives permission for dynamic initialization of a static local variable to be done "early" (as discussed below).
[stmt.dcl]/3:
Original (2010) answer:
The C and C++ standards operate under a rather simple principle generally known as the "as-if rule" -- basically, that the compiler is free to do almost anything as long as no conforming code can discern the difference between what it did and what was officially required.
I don't see a way for conforming code to discern whether
get
was actually called in this case, so it looks to me like it's free to optimize it out.At least as recently as N4296, the standard contained explicit permission to do early initialization of static local variables:
So, under this rule, initialization of the local variable could happen arbitrarily early in execution, so even if it has visible side effects, they're allowed to happen before any code that attempts to observed them. As such, you aren't guaranteed to see them, so optimizing it out is allowed.
根据您的编辑,这是一个改进的版本,具有相同的结果。
输入:
输出:
值得注意的是,没有为 ::get 发出任何代码,但 main 仍然根据需要使用保护变量分配 ::get::c (在 %4 处)(在 %2 处以及 invcont.i 和 lpad 的末尾) 。我)。这里的 llvm 内联了所有这些东西。
tl;dr:别担心,优化器通常会正确处理这些事情。您看到错误了吗?
Based on your edits, here's an improved version, with the same results.
Input:
Output:
Noteworth, no code is emitted for ::get, but main still allocates ::get::c (at %4) with a guard variable as needed (at %2 and at the end of invcont.i and lpad.i). llvm here is inlining all of that stuff.
tl;dr: Don't worry about it, the optimizer normally gets this stuff right. Are you seeing an error?
您的原始代码是安全的。不要引入额外的间接级别(必须在
std::map
的地址可用之前加载的指针变量。 )正如 Jerry Coffin 所说,您的代码必须像按源代码顺序运行一样运行。这包括运行,就好像它在 main 中的后续内容(例如启动线程)之前构建了 boost 或 std::mutex 和 std::map 。
在 C++11 之前,语言标准和内存模型并不是正式的线程感知的,但是像这样的东西(线程安全的
静态
-本地初始化)无论如何都可以工作,因为编译器编写者希望他们的编译器能够有用。例如 2006 年的 GCC 4.1 (https://godbolt.org/z/P3sjo4Tjd) 仍然使用守卫变量以确保在同时发生多个get()
调用的情况下由单个线程进行构建。现在,对于 C++11 及更高版本,ISO 标准确实包含线程,并且官方要求这样做是安全的。
由于您的程序无法观察到差异,因此假设编译器可以选择跳过构造,让其发生在第一个线程中,以未优化的方式实际调用
get()
离开。 没关系,static
局部变量的构造是线程安全的,GCC 和 Clang 等编译器使用它们检查的“保护变量”(使用只读) acquire
load) 在函数开始处。文件范围静态变量将避免每次调用时发生的保护变量的加载+测试/分支快速路径开销,并且只要没有任何东西调用 get()< /code> 在
main()
开始之前。保护变量非常便宜,尤其是在 x86、AArch64 和 32 位 ARMv8 等 ISA 上,这些 ISA 具有便宜的获取负载,但在 ARMv7 上则更昂贵,例如,获取负载使用 dmb ish 完整屏障。如果某个假设的编译器确实进行了您所担心的优化,则差异可能在于保存
static C c
的 .bss 页面的 NUMA 放置(如果该页面中没有其他内容首先被触及)。如果在第二个线程也调用get()
时构造尚未完成,则可能会在第一次调用get()
时短暂地停止其他线程。当前的 GCC 和 clang 在实践中不执行此优化,
带有 libc++ 的 Clang 17 使用
-O3
为 x86-64 生成以下 asm。 (由 神箭)。get()
的 asm 也内联到main
中。 GCC 与 libstdc++ 非常相似,实际上仅在std::map
内部有所不同。即使
std::map
未使用,构建它也需要调用__cxa_atexit
(atexit
的 C++ 内部版本)来注册析构函数当程序退出时释放红黑树。我怀疑这是对优化器不透明的部分,也是它没有像static int x = 123;
或static void *foo = &bar;
那样得到优化的主要原因。 code> 进入.data
中的预初始化空间,没有运行时构造(也没有保护变量)。如果
struct C
仅包含std::mutex
(在 GNU 中),则会通过常量传播避免任何运行时初始化。 /Linux 至少没有析构函数并且实际上是零初始化的。 (C++23 之前的 C++ 允许早期初始化,即使其中包含可见副作用。但事实并非如此;编译器仍然可以常量传播static int local_foo = an_inline_function(123);< /code> 到
.data
中的一些字节,没有运行时调用。)GCC 和 Clang 也不会优化保护变量(如果有任何运行时工作要做),即使
main
根本不启动任何线程,更不用说在调用get()
之前了。其他编译单元(包括共享库)中的构造函数可能在main
执行的同时启动了另一个调用get()
的线程。 (这可以说是 gcc -fwhole-program 错过的优化。)如果构造函数有任何(潜在的)可见副作用,可能包括对
new 的调用
由于new
是可替换的,编译器无法推迟它,因为 C++ 语言规则规定了何时在抽象机中调用构造函数。 (编译器可以对new
做出一些假设,例如,使用 libc++ 的 clang 可以针对未使用的优化掉
。)new
/delete
std::vector像
std::unordered_map
(哈希表而不是红黑树)这样的类在其构造函数中确实使用了new
。我正在使用
std::map
进行测试,因此各个对象没有具有可见副作用的析构函数。一个std::map
其中Foo::~Foo
打印一些内容会在静态本地初始化程序运行时变得很重要,因为那时我们调用__cxa_atexit
。假设销毁顺序与构造相反,等到稍后调用__cxa_atexit
可能会导致它被更快地销毁,从而导致Foo::~Foo()
调用发生得太早,可能在其他一些可见的副作用之前而不是之后。或者其他一些全局数据结构可能引用
std::map
内的int
对象,并在其析构函数中使用这些对象。 如果我们过早破坏std::map
,那就不安全了。(我不确定 ISO C++ 或 GNU C++ 是否为析构函数的排序提供了这样的排序保证。但如果确实如此,这将是编译器在涉及注册析构函数时通常无法推迟构造的原因。并寻找简单程序中的优化不值得付出编译时间的代价。)
使用文件范围
static
来避免保护变量请注意,缺少保护变量,使得快速路径更快,尤其是对于 ISA像 ARMv7 一样,它没有一个好的方法来执行获取屏障。 https://godbolt.org/z/4bGx3Tasj -
执行存储的构造函数代码并调用
__cxa_atexit
仍然存在,它只是在一个名为_GLOBAL__sub_I_example.cpp:
(clang) 或_GLOBAL__sub_I_get():
的单独函数中(GCC),编译器将其添加到要在main
之前调用的 init 函数列表中。函数范围的局部变量通常很好,开销非常小,特别是在 x86-64 和 ARMv8 上。但由于您担心微观优化,例如构建 std::map 时,我认为值得一提。并展示编译器用来使这些东西在幕后工作的机制。
Your original code is safe. Don't introduce an extra level of indirection (a pointer variable that has to get loaded before the address of the
std::map
is available.)As Jerry Coffin says, your code has to run as if it ran in source order. That includes running as-if it has constructed your boost or
std::mutex
andstd::map
before later stuff in main, such as starting threads.Pre C++11, the language standard and memory model wasn't officially thread-aware, but stuff like this (thread-safe
static
-local initialization) worked anyway because compiler writers wanted their compilers to be useful. e.g. GCC 4.1 from 2006 (https://godbolt.org/z/P3sjo4Tjd) still uses a guard variable with to make sure a single thread does the constructing in case multiple calls toget()
happen at the same time.Now, with C++11 and later, the ISO standard does include threads and it's officially required for that to be safe.
Since your program can't observe the difference, it's hypothetically possible that a compiler could choose to skip construction now let it happen in the first thread to actually call
get()
in a way that isn't optimized away. That's fine, construction ofstatic
locals is thread-safe, with compilers like GCC and Clang using a "guard variable" that they check (read-only with anacquire
load) at the start of the function.A file-scope
static
variable would avoid the load+test/branch fast-path overhead of the guard variable that happens every call, and would be safe as long as nothing callsget()
before the start ofmain()
. A guard variable is pretty cheap especially on ISAs like x86, AArch64, and 32-bit ARMv8 that have cheap acquire loads, but more expensive on ARMv7 for example where an acquire load uses admb ish
full barrier.If some hypothetical compiler actually did the optimization you're worried about, the difference could be in NUMA placement of the page of .bss holding
static C c
, if nothing else in that page was touched first. And potentially stalling other threads very briefly in their first calls toget()
if construction isn't finished by the time a second thread also callsget()
.Current GCC and clang don't in practice do this optimization
Clang 17 with libc++ makes the following asm for x86-64, with
-O3
. (demangled by Godbolt). The asm forget()
is also inlined intomain
. GCC with libstdc++ is pretty similar, really only differing in thestd::map
internals.Even though the
std::map
is unused, constructing it involves calling__cxa_atexit
(a C++-internals version ofatexit
) to register the destructor to free the red-black tree as the program exits. I suspect this is the part that's opaque to the optimizer and the main reason it doesn't get optimized likestatic int x = 123;
orstatic void *foo = &bar;
into pre-initialized space in.data
with no run-time construction (and no guard variable).Constant-propagation to avoid the need for any run-time initialization is what happens if
struct C
only includesstd::mutex
, which in GNU/Linux at least doesn't have a destructor and is actually zero-initialized. (C++ before C++23 allowed early init even when that included visible side-effects. This doesn't; compilers can still constant-propagatestatic int local_foo = an_inline_function(123);
into some bytes in.data
with no run-time call.)GCC and Clang also don't optimize away the guard variable (if there's any run-time work to do), even though
main
doesn't start any threads at all, let alone before callingget()
. A constructor in some other compilation unit (including a shared library) could have started another thread that calledget()
at the same timemain
did. (It's arguably a missed optimization withgcc -fwhole-program
.)If the constructors had any (potentially) visible side-effects, perhaps including a call to
new
sincenew
is replaceable, compilers couldn't defer it because the C++ language rules say when the constructor is called in the abstract machine. (Compilers are allowed to make some assumptions aboutnew
, though, e.g. clang with libc++ can optimize awaynew
/delete
for an unusedstd::vector
.)Classes like
std::unordered_map
(a hash table instead of a red-black tree) do usenew
in their constructor.I was testing with
std::map<int,int>
, so the individual objects don't have destructors with visible side-effects. Astd::map<Foo,Bar>
whereFoo::~Foo
prints something would make it matter when the static-local initializer runs, since that's when we call__cxa_atexit
. Assuming destruction order happens in reverse of construction, waiting until later to call__cxa_atexit
could lead to it being destructed sooner, leading toFoo::~Foo()
calls happening too soon, potentially before instead of after some other visible side effect.Or some other global data structure could maybe have references to the
int
objects inside astd::map<int,int>
, and use those in its destructor. That wouldn't be safe if we destruct thestd::map
too soon.(I'm not sure if ISO C++, or GNU C++, gives such ordering guarantees for sequencing of destructors. But if it does, that would be a reason compilers couldn't normally defer construction when it involves registering a destructor. And looking for that optimization in trivial programs isn't worth the cost in compile time.)
With file-scope
static
to avoid a guard variableNotice the lack of a guard variable, making the fast path faster, especially for ISAs like ARMv7 that don't have a good way to do just an acquire barrier. https://godbolt.org/z/4bGx3Tasj -
The constructor code that does the stores and calls
__cxa_atexit
still exists, it's just in a separate function called_GLOBAL__sub_I_example.cpp:
(clang) or_GLOBAL__sub_I_get():
(GCC), which the compiler adds to a list of init functions to be called beforemain
.Function-scoped local vars are normally fine, the overhead is pretty minimal, especially on x86-64 and ARMv8. But since you were worried about micro-optimizations like when
std::map
was constructed at all, I thought it was worth mentioning. And to show the mechanism compilers use to make this stuff work under the hood.根据标准,编译器是否优化函数调用基本上是未指定的行为。未指定的行为基本上是从一组有限可能性中选择的行为,但选择可能并非每次都一致。在这种情况下,选择是“优化”或“不优化”,标准没有指定,并且实现也不应该记录,因为给定实现可能不会一致地采取这种选择。
如果这个想法只是“触摸”,那么如果我们只添加一个虚拟易失性变量并在每次调用中虚拟递增它,
例如,会有帮助吗?
Whether the compiler optimizes the function call or not is basically unspecified behavior as per the Standard. An unspecified behavior is basically a behavior which is chosen from a set of finite possibilities, but the choice may not be consistent every time. In this case, the choice is 'to optimize' or 'not', which the Standard does not specify and the implementation is also not supposed to document, as it is a choice which may not be consistently taken by a given implementation.
If the idea is just to 'touch', will it help if we just add a dummy volatile variable and dummy increment it in each call
e.g