在循环的迭代之间消耗整个缓存线有特殊的好处吗?

发布于 2025-02-08 14:05:00 字数 3828 浏览 2 评论 0原文

我的程序添加了Float数组,并通过MSVC和G ++进行最大优化时将4倍展开。我不明白为什么两个编译器都选择展开4倍,所以我进行了一些测试,并且在运行时间内偶尔发现手动展开1-VS-2或1-VS-4迭代的t检验,给出了P值〜0.03,p值〜0.03, 2-VS-4很少是LT; 0.05和2-VS-8+始终> 0.05。

如果我将编译器设置为使用128位矢量或256位向量,则它始终是4倍,这是64个字节缓存线的倍数(显着或巧合?)。

我考虑缓存线路的原因是因为我没想到展开会对依次读取和写入浮标的千兆字节的内存程序会产生任何影响。在这种情况下,展开是否有好处?也可能没有显着差异,我的样本量不够大。

我发现这个 blog ,它说手动传播阵列副本的副本是更快的对于更长的阵列,中型数组和流媒体是最快的。他们的Avxasyncpfcopier和AvxAsyncpfunrollCopier功能似乎从使用整个高速缓存线以及手动展开中受益。带有源在这里,在博客中的基准标准。

#include <iostream>
#include <immintrin.h>

int main() {
    // example of manually unrolling float arrays
    size_t bytes = sizeof(__m256) * 10;
    size_t alignment = sizeof(__m256);
    // 10 x 32-byte vectors
    __m256* a = (__m256*) _mm_malloc(bytes, alignment); 
    __m256* b = (__m256*) _mm_malloc(bytes, alignment);
    __m256* c = (__m256*) _mm_malloc(bytes, alignment); 

    for (int i = 0; i < 10; i += 2) {
        // cache miss?
        // load 2 x 64-byte cache lines:
        //      2 x 32-byte vectors from b
        //      2 x 32-byte vectors from c
        a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);

        // cache hit?
        a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);

        // special bonus for consuming whole cache lines?
    }
}

的3个唯一浮点阵列的原始源

for (int64_t i = 0; i < size; ++i) {
    a[i] = b[i] + c[i];
}

带有AVX2指令MSVC

            a[i] = b[i] + c[i];
00007FF7E2522370  vmovups     ymm2,ymmword ptr [rax+rcx]  
00007FF7E2522375  vmovups     ymm1,ymmword ptr [rcx+rax-20h]  
00007FF7E252237B  vaddps      ymm1,ymm1,ymmword ptr [rax-20h]  
00007FF7E2522380  vmovups     ymmword ptr [rdx+rax-20h],ymm1  
00007FF7E2522386  vaddps      ymm1,ymm2,ymmword ptr [rax]  
00007FF7E252238A  vmovups     ymm2,ymmword ptr [rcx+rax+20h]  
00007FF7E2522390  vmovups     ymmword ptr [rdx+rax],ymm1  
00007FF7E2522395  vaddps      ymm1,ymm2,ymmword ptr [rax+20h]  
00007FF7E252239A  vmovups     ymm2,ymmword ptr [rcx+rax+40h]  
00007FF7E25223A0  vmovups     ymmword ptr [rdx+rax+20h],ymm1  
00007FF7E25223A6  vaddps      ymm1,ymm2,ymmword ptr [rax+40h]  
00007FF7E25223AB  add         r9,20h  
00007FF7E25223AF  vmovups     ymmword ptr [rdx+rax+40h],ymm1  
00007FF7E25223B5  lea         rax,[rax+80h]  
00007FF7E25223BC  cmp         r9,r10  
00007FF7E25223BF  jle         main$omp$2+0E0h (07FF7E2522370h) 

带有默认说明

            a[i] = b[i] + c[i];
00007FF71ECB2372  movups      xmm0,xmmword ptr [rax-10h]  
00007FF71ECB2376  add         r9,10h  
00007FF71ECB237A  movups      xmm1,xmmword ptr [rcx+rax-10h]  
00007FF71ECB237F  movups      xmm2,xmmword ptr [rax+rcx]  
00007FF71ECB2383  addps       xmm1,xmm0  
00007FF71ECB2386  movups      xmm0,xmmword ptr [rax]  
00007FF71ECB2389  addps       xmm2,xmm0  
00007FF71ECB238C  movups      xmm0,xmmword ptr [rax+10h]  
00007FF71ECB2390  movups      xmmword ptr [rdx+rax-10h],xmm1  
00007FF71ECB2395  movups      xmm1,xmmword ptr [rcx+rax+10h]  
00007FF71ECB239A  movups      xmmword ptr [rdx+rax],xmm2  
00007FF71ECB239E  movups      xmm2,xmmword ptr [rcx+rax+20h]  
00007FF71ECB23A3  addps       xmm1,xmm0  
00007FF71ECB23A6  movups      xmm0,xmmword ptr [rax+20h]  
00007FF71ECB23AA  addps       xmm2,xmm0  
00007FF71ECB23AD  movups      xmmword ptr [rdx+rax+10h],xmm1  
00007FF71ECB23B2  movups      xmmword ptr [rdx+rax+20h],xmm2  
00007FF71ECB23B7  add         rax,40h  
00007FF71ECB23BB  cmp         r9,r10  
00007FF71ECB23BE  jle         main$omp$2+0D2h (07FF71ECB2372h)  

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.

If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).

The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.

I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.

#include <iostream>
#include <immintrin.h>

int main() {
    // example of manually unrolling float arrays
    size_t bytes = sizeof(__m256) * 10;
    size_t alignment = sizeof(__m256);
    // 10 x 32-byte vectors
    __m256* a = (__m256*) _mm_malloc(bytes, alignment); 
    __m256* b = (__m256*) _mm_malloc(bytes, alignment);
    __m256* c = (__m256*) _mm_malloc(bytes, alignment); 

    for (int i = 0; i < 10; i += 2) {
        // cache miss?
        // load 2 x 64-byte cache lines:
        //      2 x 32-byte vectors from b
        //      2 x 32-byte vectors from c
        a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);

        // cache hit?
        a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);

        // special bonus for consuming whole cache lines?
    }
}

Original source for 3 unique float arrays

for (int64_t i = 0; i < size; ++i) {
    a[i] = b[i] + c[i];
}

MSVC with AVX2 instructions

            a[i] = b[i] + c[i];
00007FF7E2522370  vmovups     ymm2,ymmword ptr [rax+rcx]  
00007FF7E2522375  vmovups     ymm1,ymmword ptr [rcx+rax-20h]  
00007FF7E252237B  vaddps      ymm1,ymm1,ymmword ptr [rax-20h]  
00007FF7E2522380  vmovups     ymmword ptr [rdx+rax-20h],ymm1  
00007FF7E2522386  vaddps      ymm1,ymm2,ymmword ptr [rax]  
00007FF7E252238A  vmovups     ymm2,ymmword ptr [rcx+rax+20h]  
00007FF7E2522390  vmovups     ymmword ptr [rdx+rax],ymm1  
00007FF7E2522395  vaddps      ymm1,ymm2,ymmword ptr [rax+20h]  
00007FF7E252239A  vmovups     ymm2,ymmword ptr [rcx+rax+40h]  
00007FF7E25223A0  vmovups     ymmword ptr [rdx+rax+20h],ymm1  
00007FF7E25223A6  vaddps      ymm1,ymm2,ymmword ptr [rax+40h]  
00007FF7E25223AB  add         r9,20h  
00007FF7E25223AF  vmovups     ymmword ptr [rdx+rax+40h],ymm1  
00007FF7E25223B5  lea         rax,[rax+80h]  
00007FF7E25223BC  cmp         r9,r10  
00007FF7E25223BF  jle         main$omp$2+0E0h (07FF7E2522370h) 

MSVC with default instructions

            a[i] = b[i] + c[i];
00007FF71ECB2372  movups      xmm0,xmmword ptr [rax-10h]  
00007FF71ECB2376  add         r9,10h  
00007FF71ECB237A  movups      xmm1,xmmword ptr [rcx+rax-10h]  
00007FF71ECB237F  movups      xmm2,xmmword ptr [rax+rcx]  
00007FF71ECB2383  addps       xmm1,xmm0  
00007FF71ECB2386  movups      xmm0,xmmword ptr [rax]  
00007FF71ECB2389  addps       xmm2,xmm0  
00007FF71ECB238C  movups      xmm0,xmmword ptr [rax+10h]  
00007FF71ECB2390  movups      xmmword ptr [rdx+rax-10h],xmm1  
00007FF71ECB2395  movups      xmm1,xmmword ptr [rcx+rax+10h]  
00007FF71ECB239A  movups      xmmword ptr [rdx+rax],xmm2  
00007FF71ECB239E  movups      xmm2,xmmword ptr [rcx+rax+20h]  
00007FF71ECB23A3  addps       xmm1,xmm0  
00007FF71ECB23A6  movups      xmm0,xmmword ptr [rax+20h]  
00007FF71ECB23AA  addps       xmm2,xmm0  
00007FF71ECB23AD  movups      xmmword ptr [rdx+rax+10h],xmm1  
00007FF71ECB23B2  movups      xmmword ptr [rdx+rax+20h],xmm2  
00007FF71ECB23B7  add         rax,40h  
00007FF71ECB23BB  cmp         r9,r10  
00007FF71ECB23BE  jle         main$omp$2+0D2h (07FF71ECB2372h)  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

屋檐 2025-02-15 14:05:00

我认为,编译器循环循环的决定可能会受到各种因素的影响,包括说明管道,指导级并行性和内存访问模式。展开的循环可以帮助揭示编译器的更多机会,以优化指导计划并降低循环成本,从而有可能提高性能。

就您而言,由于您正在处理内存的操作,因此主瓶颈可能是内存访问而不是计算。展开的循环可以通过增加内存预取机会并降低循环成本来帮助提高性能。

I think the decision to unroll loops by compilers can be influenced by various factors, including instruction pipelining, instruction-level parallelism, and memory access patterns. Unrolling loops can help expose more opportunities for the compiler to optimize instruction scheduling and reduce loop cost, potentially improving performance.

In your case, since you are dealing with memory-bound operations, the main bottleneck is probably memory access rather than computation. Unrolling loops could help improve performance by increasing memory prefetching opportunities and reducing loop cost.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文