x86-64的缓存填充大小是否应该为128个字节?

发布于 2025-01-26 06:58:54 字数 1862 浏览 2 评论 0 原文

我从

从英特尔的沙桥开始,空间预摘要现在一次拉动64个字节缓存线,因此我们必须与128个字节对齐,而不是64个字节。

来源:

I didn't在英特尔手册中找到这条线。但是,在最新提交之前, folly 仍然使用128个字节填充,这使我令人信服。因此,我开始编写代码,以查看是否可以观察这种行为。这是我的代码。

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

编译器Explorer

我的CPU是Ryzen 3700x。当索引为 0 1 2 3 时,需要〜1.2s才能完成。当索引为 0 16 32 48 ,完成〜200ms。当索引为 0 32 64 96 ,它需要〜200ms才能完成,这正是和以前一样。我还在英特尔机器上测试了它们,这给了我类似的结果。

从这个微型长凳上,我看不到使用128个字节填充超过64个字节的理由。我弄错了吗?

I found a comment from crossbeam.

Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than 64.

Sources:

I didn't find the line in Intel's manual saying that. But until the latest commit, folly still uses 128 bytes padding, which makes it convincing to me. So I started to write code to see if I can observe this behavior. Here is my code.

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Compiler Explorer

My CPU is Ryzen 3700X. When the indexes are 0, 1, 2, 3, it takes ~1.2s to finish. When the indexes are 0, 16, 32, 48, it takes ~200ms to finish. When the indexes are 0, 32, 64, 96, it takes ~200ms to finish, which is exactly the same as before. I also tested them on an Intel machine and it gave me a similar result.

From this micro bench I can't see the reason to use 128 bytes padding over 64 bytes padding. Did I get anything wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦在深巷 2025-02-02 06:58:54

英特尔的优化手册确实描述了SNB家庭CPU中的L2空间预摘要。 它试图完成128B对准的64B行,当有备用内存带宽(离线请求跟踪插槽)。

是的,当第一行被拉入时, 128字节分离。如果没有任何实际 false共享(在同一64个字节线内),在一些初始混乱之后,它很快就达到了每个核心都具有修改的高速缓存线的独家所有权。这意味着没有进一步的L1D错过,因此没有对L2的请求,它会触发L2空间预摘要。

与例如不同的是,两对线程在 actomic&gt; 变量(或不)缓存线中的变量。然后L2的空间预取可能会将争论融合在一起,因此所有4个线程彼此争夺而不是2对。基本上,任何情况下,缓存线实际上是在内核之间来回弹跳的任何情况,如果您不小心,则L2空间预取的方法都会使情况变得更糟。

L2预摘要不会无限期地尝试完成其缓存的每条有效行的一对线;这会伤害这样的案例,而不同的核心反复触摸相邻的线路,这会造成更多的帮助。 )

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes带有更长基准的答案;我最近还没有看过它,但我认为它应该在64个字节上进行演示破坏性干扰,但没有128。 EM>破坏性干扰(尽管在缓存的外部级别中不如128字节线的大小,尤其是如果它是包容性缓存)。


区别

完美计数器即使您的基准也有 ; 4C8T,运行Linux 5.16),我改进了源,因此我可以在不打败内存访问的情况下进行优化,并使用CPP宏,以便我可以从编译器命令行设置步幅(In Bytes)。 Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines.实际数量从200到850变化很大,但对整个时间仍然可以忽略不计。

相邻的线,500 +-300机器

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

以128字节的分离清除了VS,只有很少的机器清除了

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

大概是对Linux如何将线程安排到这台4C8T机器上的逻辑内核的依赖。相关:

一行中与实际的错误共享:10m机器清除

商店缓冲区(和存储转发)为每台虚假共享机器清除一堆增量正如人们所期望的那样不好。 (并且使用原子RMW会更糟,例如 std :: Atomic&lt; int&gt; fetch_add ,因为每个人都需要在执行时直接访问L1D缓存。) 为什么虚假共享仍会影响错误的影响非原子,但远小于原子吗?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

改进的基准标准 - 对齐数组,挥发性为允许优化,

我使用了 volatile ,因此我可以启用优化。我假设您在禁用优化的情况下进行了编译,因此 int j 也在循环中存储/重新加载。

而且我使用 alignas(128)计数器[] ,因此我们确保数组的开始分为两对128字节线,而不是分为三个。

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.

Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.

Unlike if for example two pairs of threads contending over separate atomic<int> variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.

(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).


Perf counters reveal a difference even with your benchmark

There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.

Adjacent lines, 500 +- 300 machine clears

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

vs with 128-byte separation, only a very few machine clears

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:

vs. actual false-sharing within one line: 10M machine clears

The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int> fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

Improved benchmark- align the array, and volatile to allow optimization

I used volatile so I could enable optimization. I assume you compiled with optimization disabled, so int j was also getting stored/reloaded inside the loop.

And I used alignas(128) counter[] so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文