编译器可以优化静态局部变量的初始化吗?

发布于 2024-09-24 06:30:00 字数 432 浏览 0 评论 0原文

在以下情况下应该有什么行为:

class C {
    boost::mutex mutex_;
    std::map<...> data_;
};

C& get() {
    static C c;
    return c;
}

int main() {
    get(); // is compiler free to optimize out the call? 
    ....
}

是否允许编译器优化对 get() 的调用?

这个想法是在需要多线程操作之前触摸静态变量来初始化它,

这是一个更好的选择吗?:

C& get() {
    static C *c = new C();
    return *c;
}

what should be the behavior in the following case:

class C {
    boost::mutex mutex_;
    std::map<...> data_;
};

C& get() {
    static C c;
    return c;
}

int main() {
    get(); // is compiler free to optimize out the call? 
    ....
}

is compiler allowed to optimize out the call to get()?

the idea was to touch static variable to initialize it before multithreaded operations needed it

is this a better option?:

C& get() {
    static C *c = new C();
    return *c;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

怪我鬧 2024-10-01 06:30:00

更新 (2023) 答案:

在 C++23 (N4950) 中,初始化静态局部变量的任何副作用在进入其包含块时都是可见的。因此,除非编译器可以确定初始化变量没有明显的副作用,否则它将必须生成代码以在适当的时间调用 get() (或执行 get() 的内联版本code>get(),视情况而定)。

与早期标准相反,C++ 23 不再允许“提前”完成静态局部变量的动态初始化(如下所述)。

[stmt.dcl]/3:

动态初始化具有静态存储持续时间(6.7.5.2)或线程存储持续时间的块变量
(6.7.5.3) 在控制第一次通过其声明时执行;这样的变量被认为
初始化完成后进行初始化。

原始(2010)答案:

C 和 C++ 标准在一个相当简单的原则下运行,通常称为“as-if 规则”——基本上,只要没有一致的代码可以辨别差异,编译器就可以自由地执行几乎任何操作在它所做的事情和官方要求的事情之间。

我没有看到一种方法可以使代码一致来辨别在这种情况下是否实际调用了 get ,因此在我看来可以自由地对其进行优化。

至少在 N4296 中,该标准包含对静态局部变量进行早期初始化的显式许可:

a 的常量初始化(3.6.2)
具有静态存储持续时间的块范围实体(如果适用)在首次进入其块之前执行。
允许实现使用 static 或
在允许实现静态初始化的相同条件下线程存储持续时间
在命名空间范围内具有静态或线程存储持续时间的变量 (3.6.2)。否则这样的变量是
控件第一次通过其声明时初始化;这样的变量被认为是初始化的
初始化完成。

因此,根据此规则,局部变量的初始化可以在执行的早期任意发生,因此即使它具有明显的副作用,它们也可以在任何尝试观察它们的代码之前发生。因此,您不能保证看到它们,因此允许对其进行优化。

Updated (2023) Answer:

In C++23 (N4950) any side effects of initializing a static local variable are observable as its containing block is entered. As such, unless the compiler can determine that initializing the variable has no visible side effects, it will have to generate code for to call get() at the appropriate time (or to execute an inlined version of get(), as the case may be).

Contrary to earlier standards, C++ 23 no longer gives permission for dynamic initialization of a static local variable to be done "early" (as discussed below).

[stmt.dcl]/3:

Dynamic initialization of a block variable with static storage duration (6.7.5.2) or thread storage duration
(6.7.5.3) is performed the first time control passes through its declaration; such a variable is considered
initialized upon the completion of its initialization.

Original (2010) answer:

The C and C++ standards operate under a rather simple principle generally known as the "as-if rule" -- basically, that the compiler is free to do almost anything as long as no conforming code can discern the difference between what it did and what was officially required.

I don't see a way for conforming code to discern whether get was actually called in this case, so it looks to me like it's free to optimize it out.

At least as recently as N4296, the standard contained explicit permission to do early initialization of static local variables:

Constant initialization (3.6.2) of a
block-scope entity with static storage duration, if applicable, is performed before its block is first entered.
An implementation is permitted to perform early initialization of other block-scope variables with static or
thread storage duration under the same conditions that an implementation is permitted to statically initialize
a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is
initialized the first time control passes through its declaration; such a variable is considered initialized upon
the completion of its initialization.

So, under this rule, initialization of the local variable could happen arbitrarily early in execution, so even if it has visible side effects, they're allowed to happen before any code that attempts to observed them. As such, you aren't guaranteed to see them, so optimizing it out is allowed.

梦幻的味道 2024-10-01 06:30:00

根据您的编辑,这是一个改进的版本,具有相同的结果。

输入:

struct C { 
    int myfrob;
    int frob();
    C(int f);
 };
C::C(int f) : myfrob(f) {}
int C::frob() { return myfrob; }

C& get() {
    static C *c = new C(5);
    return *c;
}

int main() {
    return get().frob(); // is compiler free to optimize out the call? 

}

输出:

; ModuleID = '/tmp/webcompile/_28088_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

%struct.C = type { i32 }

@guard variable for get()::c = internal global i64 0            ; <i64*> [#uses=4]

declare i32 @__cxa_guard_acquire(i64*) nounwind

declare i8* @operator new(unsigned long)(i64)

declare void @__cxa_guard_release(i64*) nounwind

declare i8* @llvm.eh.exception() nounwind readonly

declare i32 @llvm.eh.selector(i8*, i8*, ...) nounwind

declare void @__cxa_guard_abort(i64*) nounwind

declare i32 @__gxx_personality_v0(...)

declare void @_Unwind_Resume_or_Rethrow(i8*)

define i32 @main() {
entry:
  %0 = load i8* bitcast (i64* @guard variable for get()::c to i8*), align 8 ; <i8> [#uses=1]
  %1 = icmp eq i8 %0, 0                           ; <i1> [#uses=1]
  br i1 %1, label %bb.i, label %_Z3getv.exit

bb.i:                                             ; preds = %entry
  %2 = tail call i32 @__cxa_guard_acquire(i64* @guard variable for get()::c) nounwind ; <i32> [#uses=1]
  %3 = icmp eq i32 %2, 0                          ; <i1> [#uses=1]
  br i1 %3, label %_Z3getv.exit, label %bb1.i

bb1.i:                                            ; preds = %bb.i
  %4 = invoke i8* @operator new(unsigned long)(i64 4)
          to label %invcont.i unwind label %lpad.i ; <i8*> [#uses=2]

invcont.i:                                        ; preds = %bb1.i
  %5 = bitcast i8* %4 to %struct.C*               ; <%struct.C*> [#uses=1]
  %6 = bitcast i8* %4 to i32*                     ; <i32*> [#uses=1]
  store i32 5, i32* %6, align 4
  tail call void @__cxa_guard_release(i64* @guard variable for get()::c) nounwind
  br label %_Z3getv.exit

lpad.i:                                           ; preds = %bb1.i
  %eh_ptr.i = tail call i8* @llvm.eh.exception()  ; <i8*> [#uses=2]
  %eh_select12.i = tail call i32 (i8*, i8*, ...)* @llvm.eh.selector(i8* %eh_ptr.i, i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*), i8* null) ; <i32> [#uses=0]
  tail call void @__cxa_guard_abort(i64* @guard variable for get()::c) nounwind
  tail call void @_Unwind_Resume_or_Rethrow(i8* %eh_ptr.i)
  unreachable

_Z3getv.exit:                                     ; preds = %invcont.i, %bb.i, %entry
  %_ZZ3getvE1c.0 = phi %struct.C* [ null, %bb.i ], [ %5, %invcont.i ], [ null, %entry ] ; <%struct.C*> [#uses=1]
  %7 = getelementptr inbounds %struct.C* %_ZZ3getvE1c.0, i64 0, i32 0 ; <i32*> [#uses=1]
  %8 = load i32* %7, align 4                      ; <i32> [#uses=1]
  ret i32 %8
}

值得注意的是,没有为 ::get 发出任何代码,但 main 仍然根据需要使用保护变量分配 ::get::c (在 %4 处)(在 %2 处以及 invcont.i 和 lpad 的末尾) 。我)。这里的 llvm 内联了所有这些东西。

tl;dr:别担心,优化器通常会正确处理这些事情。您看到错误了吗?

Based on your edits, here's an improved version, with the same results.

Input:

struct C { 
    int myfrob;
    int frob();
    C(int f);
 };
C::C(int f) : myfrob(f) {}
int C::frob() { return myfrob; }

C& get() {
    static C *c = new C(5);
    return *c;
}

int main() {
    return get().frob(); // is compiler free to optimize out the call? 

}

Output:

; ModuleID = '/tmp/webcompile/_28088_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

%struct.C = type { i32 }

@guard variable for get()::c = internal global i64 0            ; <i64*> [#uses=4]

declare i32 @__cxa_guard_acquire(i64*) nounwind

declare i8* @operator new(unsigned long)(i64)

declare void @__cxa_guard_release(i64*) nounwind

declare i8* @llvm.eh.exception() nounwind readonly

declare i32 @llvm.eh.selector(i8*, i8*, ...) nounwind

declare void @__cxa_guard_abort(i64*) nounwind

declare i32 @__gxx_personality_v0(...)

declare void @_Unwind_Resume_or_Rethrow(i8*)

define i32 @main() {
entry:
  %0 = load i8* bitcast (i64* @guard variable for get()::c to i8*), align 8 ; <i8> [#uses=1]
  %1 = icmp eq i8 %0, 0                           ; <i1> [#uses=1]
  br i1 %1, label %bb.i, label %_Z3getv.exit

bb.i:                                             ; preds = %entry
  %2 = tail call i32 @__cxa_guard_acquire(i64* @guard variable for get()::c) nounwind ; <i32> [#uses=1]
  %3 = icmp eq i32 %2, 0                          ; <i1> [#uses=1]
  br i1 %3, label %_Z3getv.exit, label %bb1.i

bb1.i:                                            ; preds = %bb.i
  %4 = invoke i8* @operator new(unsigned long)(i64 4)
          to label %invcont.i unwind label %lpad.i ; <i8*> [#uses=2]

invcont.i:                                        ; preds = %bb1.i
  %5 = bitcast i8* %4 to %struct.C*               ; <%struct.C*> [#uses=1]
  %6 = bitcast i8* %4 to i32*                     ; <i32*> [#uses=1]
  store i32 5, i32* %6, align 4
  tail call void @__cxa_guard_release(i64* @guard variable for get()::c) nounwind
  br label %_Z3getv.exit

lpad.i:                                           ; preds = %bb1.i
  %eh_ptr.i = tail call i8* @llvm.eh.exception()  ; <i8*> [#uses=2]
  %eh_select12.i = tail call i32 (i8*, i8*, ...)* @llvm.eh.selector(i8* %eh_ptr.i, i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*), i8* null) ; <i32> [#uses=0]
  tail call void @__cxa_guard_abort(i64* @guard variable for get()::c) nounwind
  tail call void @_Unwind_Resume_or_Rethrow(i8* %eh_ptr.i)
  unreachable

_Z3getv.exit:                                     ; preds = %invcont.i, %bb.i, %entry
  %_ZZ3getvE1c.0 = phi %struct.C* [ null, %bb.i ], [ %5, %invcont.i ], [ null, %entry ] ; <%struct.C*> [#uses=1]
  %7 = getelementptr inbounds %struct.C* %_ZZ3getvE1c.0, i64 0, i32 0 ; <i32*> [#uses=1]
  %8 = load i32* %7, align 4                      ; <i32> [#uses=1]
  ret i32 %8
}

Noteworth, no code is emitted for ::get, but main still allocates ::get::c (at %4) with a guard variable as needed (at %2 and at the end of invcont.i and lpad.i). llvm here is inlining all of that stuff.

tl;dr: Don't worry about it, the optimizer normally gets this stuff right. Are you seeing an error?

淑女气质 2024-10-01 06:30:00

您的原始代码是安全的。不要引入额外的间接级别(必须在 std::map 的地址可用之前加载的指针变量。 )

正如 Jerry Coffin 所说,您的代码必须像按源代码顺序运行一样运行。这包括运行,就好像它在 main 中的后续内容(例如启动线程)之前构建了 boost 或 std::mutex 和 std::map 。

在 C++11 之前,语言标准和内存模型并不是正式的线程感知的,但是像这样的东西(线程安全的静态-本地初始化)无论如何都可以工作,因为编译器编写者希望他们的编译器能够有用。例如 2006 年的 GCC 4.1 (https://godbolt.org/z/P3sjo4Tjd) 仍然使用守卫变量以确保在同时发生多个 get() 调用的情况下由单个线程进行构建。

现在,对于 C++11 及更高版本,ISO 标准确实包含线程,并且官方要求这样做是安全的。


由于您的程序无法观察到差异,因此假设编译器可以选择跳过构造,让其发生在第一个线程中,以未优化的方式实际调用 get()离开。 没关系,static 局部变量的构造是线程安全的,GCC 和 Clang 等编译器使用它们检查的“保护变量”(使用 只读) acquire load) 在函数开始处。

文件范围静态变量将避免每次调用时发生的保护变量的加载+测试/分支快速路径开销,并且只要没有任何东西调用 get()< /code> 在 main() 开始之前。保护变量非常便宜,尤其是在 x86、AArch64 和 32 位 ARMv8 等 ISA 上,这些 ISA 具有便宜的获取负载,但在 ARMv7 上则更昂贵,例如,获取负载使用 dmb ish 完整屏障。

如果某个假设的编译器确实进行了您所担心的优化,则差异可能在于保存 static C c 的 .bss 页面的 NUMA 放置(如果该页面中没有其他内容首先被触及)。如果在第二个线程也调用 get() 时构造尚未完成,则可能会在第一次调用 get() 时短暂地停止其他线程。


当前的 GCC 和 clang 在实践中执行此优化,

带有 libc++ 的 Clang 17 使用 -O3 为 x86-64 生成以下 asm。 (由 神箭)。 get() 的 asm 也内联到 main 中。 GCC 与 libstdc++ 非常相似,实际上仅在 std::map 内部有所不同。

get():
        movzx   eax, byte ptr [rip + guard variable for get()::c]  # all x86 loads are acquire loads
        test    al, al                       # check the guard variable
        je      .LBB0_1
        lea     rax, [rip + get()::c]        # retval = address of the static variable
   # end of the fast path through the function.
   # after the first call, all callers go through this path.
        ret

 # slow path, only reached if the guard variable is zero
.LBB0_1:
        push    rax
        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_acquire@PLT
        test    eax, eax   # check if we won the race to construct c,
        je      .LBB0_3    # or if we waited until another thread finished doing it.

        xorps   xmm0, xmm0
        movups  xmmword ptr [rip + get()::c+16], xmm0     # first 16 bytes of std::map<int,int> = NULL pointers
        movups  xmmword ptr [rip + get()::c], xmm0        # std::mutex = 16 bytes of zeros
        mov     qword ptr [rip + get()::c+32], 0          # another NULL
        lea     rsi, [rip + get()::c]                     # arg for __cxa_atexit
        movups  xmmword ptr [rip + get()::c+48], xmm0     # more zeros, maybe a root node?
        lea     rax, [rip + get()::c+48]                  
        mov     qword ptr [rip + get()::c+40], rax        # pointer to another part of the map object

        lea     rdi, [rip + C::~C() [base object destructor]]  # more args for atexit
        lea     rdx, [rip + __dso_handle]
        call    __cxa_atexit@PLT                 # register the destructor function-pointer with a "this" pointer

        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_release@PLT          # "unlock" the guard variable, setting it to 1 for future calls
             # and letting any other threads return from __cxa_guard_acquire and see a fully-constructed object

.LBB0_3:                                     # epilogue
        add     rsp, 8
        lea     rax, [rip + get()::c]        # return value, same as in the fast path.
        ret

即使 std::map 未使用,构建它也需要调用 __cxa_atexitatexit 的 C++ 内部版本)来注册析构函数当程序退出时释放红黑树。我怀疑这是对优化器不透明的部分,也是它没有像 static int x = 123;static void *foo = &bar; 那样得到优化的主要原因。 code> 进入 .data 中的预初始化空间,没有运行时构造(也没有保护变量)。

如果 struct C 仅包含 std::mutex(在 GNU 中),则会通过常量传播避免任何运行时初始化。 /Linux 至少没有析构函数并且实际上是零初始化的。 (C++23 之前的 C++ 允许早期初始化,即使其中包含可见副作用。但事实并非如此;编译器仍然可以常量传播 static int local_foo = an_inline_function(123);< /code> 到 .data 中的一些字节,没有运行时调用。)

GCC 和 Clang 也不会优化保护变量(如果有任何运行时工作要做),即使main 根本不启动任何线程,更不用说在调用 get() 之前了。其他编译单元(包括共享库)中的构造函数可能在 main 执行的同时启动了另一个调用 get() 的线程。 (这可以说是 gcc -fwhole-program 错过的优化。)


如果构造函数有任何(潜在的)可见副作用,可能包括对 new 的调用 由于 new 是可替换的,编译器无法推迟它,因为 C++ 语言规则规定了何时在抽象机中调用构造函数。 (编译器可以对 new 做出一些假设,例如,使用 libc++ 的 clang 可以针对未使用的 优化掉 new / delete std::vector。)

std::unordered_map (哈希表而不是红黑树)这样的类在其构造函数中确实使用了 new

我正在使用 std::map 进行测试,因此各个对象没有具有可见副作用的析构函数。一个 std::map 其中 Foo::~Foo 打印一些内容在静态本地初始化程序运行时变得很重要,因为那时我们调用 __cxa_atexit。假设销毁顺序与构造相反,等到稍后调用 __cxa_atexit 可能会导致它被更快地销毁,从而导致 Foo::~Foo() 调用发生得太早,可能在其他一些可见的副作用之前而不是之后。

或者其他一些全局数据结构可能引用 std::map 内的 int 对象,并在其析构函数中使用这些对象。 如果我们过早破坏 std::map ,那就不安全了。

(我不确定 ISO C++ 或 GNU C++ 是否为析构函数的排序提供了这样的排序保证。但如果确实如此,这将是编译器在涉及注册析构函数时通常无法推迟构造的原因。并寻找简单程序中的优化不值得付出编译时间的代价。)


使用文件范围 static 来避免保护变量

请注意,缺少保护变量,使得快速路径更快,尤其是对于 ISA像 ARMv7 一样,它没有一个好的方法来执行获取屏障。 https://godbolt.org/z/4bGx3Tasj -

static C global_c;     // It's not actually global, just file-scoped static

C& get2() {
    return global_c;
}
# clang -O3 for x86-64
get2():
      # note the lack of a load + branch on a guard variable
        lea     rax, [rip + global_c]
        ret

main:
      # construction already happened before main started, and we don't do anything with the address
        xor     eax, eax
        ret
# GCC -O3 -mcpu=cortex-a15     // a random ARMv7 CPU
get2():
        ldr     r0, .L81          @ PC-relative load
        bx      lr

@ somewhere nearby, between functions
.L81:
        .word   .LANCHOR0+52      @ pointer to struct C global_c

main:
        mov     r0, #0
        bx      lr

执行存储的构造函数代码并调用 __cxa_atexit 仍然存在,它只是在一个名为 _GLOBAL__sub_I_example.cpp: (clang) 或 _GLOBAL__sub_I_get(): 的单独函数中(GCC),编译器将其添加到要在 main 之前调用的 init 函数列表中。

函数范围的局部变量通常很好,开销非常小,特别是在 x86-64 和 ARMv8 上。但由于您担心微观优化,例如构建 std::map 时,我认为值得一提。并展示编译器用来使这些东西在幕后工作的机制。

Your original code is safe. Don't introduce an extra level of indirection (a pointer variable that has to get loaded before the address of the std::map is available.)

As Jerry Coffin says, your code has to run as if it ran in source order. That includes running as-if it has constructed your boost or std::mutex and std::map before later stuff in main, such as starting threads.

Pre C++11, the language standard and memory model wasn't officially thread-aware, but stuff like this (thread-safe static-local initialization) worked anyway because compiler writers wanted their compilers to be useful. e.g. GCC 4.1 from 2006 (https://godbolt.org/z/P3sjo4Tjd) still uses a guard variable with to make sure a single thread does the constructing in case multiple calls to get() happen at the same time.

Now, with C++11 and later, the ISO standard does include threads and it's officially required for that to be safe.


Since your program can't observe the difference, it's hypothetically possible that a compiler could choose to skip construction now let it happen in the first thread to actually call get() in a way that isn't optimized away. That's fine, construction of static locals is thread-safe, with compilers like GCC and Clang using a "guard variable" that they check (read-only with an acquire load) at the start of the function.

A file-scope static variable would avoid the load+test/branch fast-path overhead of the guard variable that happens every call, and would be safe as long as nothing calls get() before the start of main(). A guard variable is pretty cheap especially on ISAs like x86, AArch64, and 32-bit ARMv8 that have cheap acquire loads, but more expensive on ARMv7 for example where an acquire load uses a dmb ish full barrier.

If some hypothetical compiler actually did the optimization you're worried about, the difference could be in NUMA placement of the page of .bss holding static C c, if nothing else in that page was touched first. And potentially stalling other threads very briefly in their first calls to get() if construction isn't finished by the time a second thread also calls get().


Current GCC and clang don't in practice do this optimization

Clang 17 with libc++ makes the following asm for x86-64, with -O3. (demangled by Godbolt). The asm for get() is also inlined into main. GCC with libstdc++ is pretty similar, really only differing in the std::map internals.

get():
        movzx   eax, byte ptr [rip + guard variable for get()::c]  # all x86 loads are acquire loads
        test    al, al                       # check the guard variable
        je      .LBB0_1
        lea     rax, [rip + get()::c]        # retval = address of the static variable
   # end of the fast path through the function.
   # after the first call, all callers go through this path.
        ret

 # slow path, only reached if the guard variable is zero
.LBB0_1:
        push    rax
        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_acquire@PLT
        test    eax, eax   # check if we won the race to construct c,
        je      .LBB0_3    # or if we waited until another thread finished doing it.

        xorps   xmm0, xmm0
        movups  xmmword ptr [rip + get()::c+16], xmm0     # first 16 bytes of std::map<int,int> = NULL pointers
        movups  xmmword ptr [rip + get()::c], xmm0        # std::mutex = 16 bytes of zeros
        mov     qword ptr [rip + get()::c+32], 0          # another NULL
        lea     rsi, [rip + get()::c]                     # arg for __cxa_atexit
        movups  xmmword ptr [rip + get()::c+48], xmm0     # more zeros, maybe a root node?
        lea     rax, [rip + get()::c+48]                  
        mov     qword ptr [rip + get()::c+40], rax        # pointer to another part of the map object

        lea     rdi, [rip + C::~C() [base object destructor]]  # more args for atexit
        lea     rdx, [rip + __dso_handle]
        call    __cxa_atexit@PLT                 # register the destructor function-pointer with a "this" pointer

        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_release@PLT          # "unlock" the guard variable, setting it to 1 for future calls
             # and letting any other threads return from __cxa_guard_acquire and see a fully-constructed object

.LBB0_3:                                     # epilogue
        add     rsp, 8
        lea     rax, [rip + get()::c]        # return value, same as in the fast path.
        ret

Even though the std::map is unused, constructing it involves calling __cxa_atexit (a C++-internals version of atexit) to register the destructor to free the red-black tree as the program exits. I suspect this is the part that's opaque to the optimizer and the main reason it doesn't get optimized like static int x = 123; or static void *foo = &bar; into pre-initialized space in .data with no run-time construction (and no guard variable).

Constant-propagation to avoid the need for any run-time initialization is what happens if struct C only includes std::mutex, which in GNU/Linux at least doesn't have a destructor and is actually zero-initialized. (C++ before C++23 allowed early init even when that included visible side-effects. This doesn't; compilers can still constant-propagate static int local_foo = an_inline_function(123); into some bytes in .data with no run-time call.)

GCC and Clang also don't optimize away the guard variable (if there's any run-time work to do), even though main doesn't start any threads at all, let alone before calling get(). A constructor in some other compilation unit (including a shared library) could have started another thread that called get() at the same time main did. (It's arguably a missed optimization with gcc -fwhole-program.)


If the constructors had any (potentially) visible side-effects, perhaps including a call to new since new is replaceable, compilers couldn't defer it because the C++ language rules say when the constructor is called in the abstract machine. (Compilers are allowed to make some assumptions about new, though, e.g. clang with libc++ can optimize away new / delete for an unused std::vector.)

Classes like std::unordered_map (a hash table instead of a red-black tree) do use new in their constructor.

I was testing with std::map<int,int>, so the individual objects don't have destructors with visible side-effects. A std::map<Foo,Bar> where Foo::~Foo prints something would make it matter when the static-local initializer runs, since that's when we call __cxa_atexit. Assuming destruction order happens in reverse of construction, waiting until later to call __cxa_atexit could lead to it being destructed sooner, leading to Foo::~Foo() calls happening too soon, potentially before instead of after some other visible side effect.

Or some other global data structure could maybe have references to the int objects inside a std::map<int,int>, and use those in its destructor. That wouldn't be safe if we destruct the std::map too soon.

(I'm not sure if ISO C++, or GNU C++, gives such ordering guarantees for sequencing of destructors. But if it does, that would be a reason compilers couldn't normally defer construction when it involves registering a destructor. And looking for that optimization in trivial programs isn't worth the cost in compile time.)


With file-scope static to avoid a guard variable

Notice the lack of a guard variable, making the fast path faster, especially for ISAs like ARMv7 that don't have a good way to do just an acquire barrier. https://godbolt.org/z/4bGx3Tasj -

static C global_c;     // It's not actually global, just file-scoped static

C& get2() {
    return global_c;
}
# clang -O3 for x86-64
get2():
      # note the lack of a load + branch on a guard variable
        lea     rax, [rip + global_c]
        ret

main:
      # construction already happened before main started, and we don't do anything with the address
        xor     eax, eax
        ret
# GCC -O3 -mcpu=cortex-a15     // a random ARMv7 CPU
get2():
        ldr     r0, .L81          @ PC-relative load
        bx      lr

@ somewhere nearby, between functions
.L81:
        .word   .LANCHOR0+52      @ pointer to struct C global_c

main:
        mov     r0, #0
        bx      lr

The constructor code that does the stores and calls __cxa_atexit still exists, it's just in a separate function called _GLOBAL__sub_I_example.cpp: (clang) or _GLOBAL__sub_I_get(): (GCC), which the compiler adds to a list of init functions to be called before main.

Function-scoped local vars are normally fine, the overhead is pretty minimal, especially on x86-64 and ARMv8. But since you were worried about micro-optimizations like when std::map was constructed at all, I thought it was worth mentioning. And to show the mechanism compilers use to make this stuff work under the hood.

忆伤 2024-10-01 06:30:00

根据标准,编译器是否优化函数调用基本上是未指定的行为。未指定的行为基本上是从一组有限可能性中选择的行为,但选择可能并非每次都一致。在这种情况下,选择是“优化”或“不优化”,标准没有指定,并且实现也不应该记录,因为给定实现可能不会一致地采取这种选择。

如果这个想法只是“触摸”,那么如果我们只添加一个虚拟易失性变量并在每次调用中虚拟递增它,

例如,会有帮助吗?

C& getC(){
   volatile int dummy;
   dummy++;
   // rest of the code
}

Whether the compiler optimizes the function call or not is basically unspecified behavior as per the Standard. An unspecified behavior is basically a behavior which is chosen from a set of finite possibilities, but the choice may not be consistent every time. In this case, the choice is 'to optimize' or 'not', which the Standard does not specify and the implementation is also not supposed to document, as it is a choice which may not be consistently taken by a given implementation.

If the idea is just to 'touch', will it help if we just add a dummy volatile variable and dummy increment it in each call

e.g

C& getC(){
   volatile int dummy;
   dummy++;
   // rest of the code
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文