在循环的迭代之间消耗整个缓存线有特殊的好处吗?
我的程序添加了Float数组,并通过MSVC和G ++进行最大优化时将4倍展开。我不明白为什么两个编译器都选择展开4倍,所以我进行了一些测试,并且在运行时间内偶尔发现手动展开1-VS-2或1-VS-4迭代的t检验,给出了P值〜0.03,p值〜0.03, 2-VS-4很少是LT; 0.05和2-VS-8+始终> 0.05。
如果我将编译器设置为使用128位矢量或256位向量,则它始终是4倍,这是64个字节缓存线的倍数(显着或巧合?)。
我考虑缓存线路的原因是因为我没想到展开会对依次读取和写入浮标的千兆字节的内存程序会产生任何影响。在这种情况下,展开是否有好处?也可能没有显着差异,我的样本量不够大。
我发现这个 blog ,它说手动传播阵列副本的副本是更快的对于更长的阵列,中型数组和流媒体是最快的。他们的Avxasyncpfcopier和AvxAsyncpfunrollCopier功能似乎从使用整个高速缓存线以及手动展开中受益。带有源在这里,在博客中的基准标准。
#include <iostream>
#include <immintrin.h>
int main() {
// example of manually unrolling float arrays
size_t bytes = sizeof(__m256) * 10;
size_t alignment = sizeof(__m256);
// 10 x 32-byte vectors
__m256* a = (__m256*) _mm_malloc(bytes, alignment);
__m256* b = (__m256*) _mm_malloc(bytes, alignment);
__m256* c = (__m256*) _mm_malloc(bytes, alignment);
for (int i = 0; i < 10; i += 2) {
// cache miss?
// load 2 x 64-byte cache lines:
// 2 x 32-byte vectors from b
// 2 x 32-byte vectors from c
a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);
// cache hit?
a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);
// special bonus for consuming whole cache lines?
}
}
的3个唯一浮点阵列的原始源
for (int64_t i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
带有AVX2指令MSVC
a[i] = b[i] + c[i];
00007FF7E2522370 vmovups ymm2,ymmword ptr [rax+rcx]
00007FF7E2522375 vmovups ymm1,ymmword ptr [rcx+rax-20h]
00007FF7E252237B vaddps ymm1,ymm1,ymmword ptr [rax-20h]
00007FF7E2522380 vmovups ymmword ptr [rdx+rax-20h],ymm1
00007FF7E2522386 vaddps ymm1,ymm2,ymmword ptr [rax]
00007FF7E252238A vmovups ymm2,ymmword ptr [rcx+rax+20h]
00007FF7E2522390 vmovups ymmword ptr [rdx+rax],ymm1
00007FF7E2522395 vaddps ymm1,ymm2,ymmword ptr [rax+20h]
00007FF7E252239A vmovups ymm2,ymmword ptr [rcx+rax+40h]
00007FF7E25223A0 vmovups ymmword ptr [rdx+rax+20h],ymm1
00007FF7E25223A6 vaddps ymm1,ymm2,ymmword ptr [rax+40h]
00007FF7E25223AB add r9,20h
00007FF7E25223AF vmovups ymmword ptr [rdx+rax+40h],ymm1
00007FF7E25223B5 lea rax,[rax+80h]
00007FF7E25223BC cmp r9,r10
00007FF7E25223BF jle main$omp$2+0E0h (07FF7E2522370h)
带有默认说明
a[i] = b[i] + c[i];
00007FF71ECB2372 movups xmm0,xmmword ptr [rax-10h]
00007FF71ECB2376 add r9,10h
00007FF71ECB237A movups xmm1,xmmword ptr [rcx+rax-10h]
00007FF71ECB237F movups xmm2,xmmword ptr [rax+rcx]
00007FF71ECB2383 addps xmm1,xmm0
00007FF71ECB2386 movups xmm0,xmmword ptr [rax]
00007FF71ECB2389 addps xmm2,xmm0
00007FF71ECB238C movups xmm0,xmmword ptr [rax+10h]
00007FF71ECB2390 movups xmmword ptr [rdx+rax-10h],xmm1
00007FF71ECB2395 movups xmm1,xmmword ptr [rcx+rax+10h]
00007FF71ECB239A movups xmmword ptr [rdx+rax],xmm2
00007FF71ECB239E movups xmm2,xmmword ptr [rcx+rax+20h]
00007FF71ECB23A3 addps xmm1,xmm0
00007FF71ECB23A6 movups xmm0,xmmword ptr [rax+20h]
00007FF71ECB23AA addps xmm2,xmm0
00007FF71ECB23AD movups xmmword ptr [rdx+rax+10h],xmm1
00007FF71ECB23B2 movups xmmword ptr [rdx+rax+20h],xmm2
00007FF71ECB23B7 add rax,40h
00007FF71ECB23BB cmp r9,r10
00007FF71ECB23BE jle main$omp$2+0D2h (07FF71ECB2372h)
My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.
If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).
The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.
I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.
#include <iostream>
#include <immintrin.h>
int main() {
// example of manually unrolling float arrays
size_t bytes = sizeof(__m256) * 10;
size_t alignment = sizeof(__m256);
// 10 x 32-byte vectors
__m256* a = (__m256*) _mm_malloc(bytes, alignment);
__m256* b = (__m256*) _mm_malloc(bytes, alignment);
__m256* c = (__m256*) _mm_malloc(bytes, alignment);
for (int i = 0; i < 10; i += 2) {
// cache miss?
// load 2 x 64-byte cache lines:
// 2 x 32-byte vectors from b
// 2 x 32-byte vectors from c
a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);
// cache hit?
a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);
// special bonus for consuming whole cache lines?
}
}
Original source for 3 unique float arrays
for (int64_t i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
MSVC with AVX2 instructions
a[i] = b[i] + c[i];
00007FF7E2522370 vmovups ymm2,ymmword ptr [rax+rcx]
00007FF7E2522375 vmovups ymm1,ymmword ptr [rcx+rax-20h]
00007FF7E252237B vaddps ymm1,ymm1,ymmword ptr [rax-20h]
00007FF7E2522380 vmovups ymmword ptr [rdx+rax-20h],ymm1
00007FF7E2522386 vaddps ymm1,ymm2,ymmword ptr [rax]
00007FF7E252238A vmovups ymm2,ymmword ptr [rcx+rax+20h]
00007FF7E2522390 vmovups ymmword ptr [rdx+rax],ymm1
00007FF7E2522395 vaddps ymm1,ymm2,ymmword ptr [rax+20h]
00007FF7E252239A vmovups ymm2,ymmword ptr [rcx+rax+40h]
00007FF7E25223A0 vmovups ymmword ptr [rdx+rax+20h],ymm1
00007FF7E25223A6 vaddps ymm1,ymm2,ymmword ptr [rax+40h]
00007FF7E25223AB add r9,20h
00007FF7E25223AF vmovups ymmword ptr [rdx+rax+40h],ymm1
00007FF7E25223B5 lea rax,[rax+80h]
00007FF7E25223BC cmp r9,r10
00007FF7E25223BF jle main$omp$2+0E0h (07FF7E2522370h)
MSVC with default instructions
a[i] = b[i] + c[i];
00007FF71ECB2372 movups xmm0,xmmword ptr [rax-10h]
00007FF71ECB2376 add r9,10h
00007FF71ECB237A movups xmm1,xmmword ptr [rcx+rax-10h]
00007FF71ECB237F movups xmm2,xmmword ptr [rax+rcx]
00007FF71ECB2383 addps xmm1,xmm0
00007FF71ECB2386 movups xmm0,xmmword ptr [rax]
00007FF71ECB2389 addps xmm2,xmm0
00007FF71ECB238C movups xmm0,xmmword ptr [rax+10h]
00007FF71ECB2390 movups xmmword ptr [rdx+rax-10h],xmm1
00007FF71ECB2395 movups xmm1,xmmword ptr [rcx+rax+10h]
00007FF71ECB239A movups xmmword ptr [rdx+rax],xmm2
00007FF71ECB239E movups xmm2,xmmword ptr [rcx+rax+20h]
00007FF71ECB23A3 addps xmm1,xmm0
00007FF71ECB23A6 movups xmm0,xmmword ptr [rax+20h]
00007FF71ECB23AA addps xmm2,xmm0
00007FF71ECB23AD movups xmmword ptr [rdx+rax+10h],xmm1
00007FF71ECB23B2 movups xmmword ptr [rdx+rax+20h],xmm2
00007FF71ECB23B7 add rax,40h
00007FF71ECB23BB cmp r9,r10
00007FF71ECB23BE jle main$omp$2+0D2h (07FF71ECB2372h)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为,编译器循环循环的决定可能会受到各种因素的影响,包括说明管道,指导级并行性和内存访问模式。展开的循环可以帮助揭示编译器的更多机会,以优化指导计划并降低循环成本,从而有可能提高性能。
就您而言,由于您正在处理内存的操作,因此主瓶颈可能是内存访问而不是计算。展开的循环可以通过增加内存预取机会并降低循环成本来帮助提高性能。
I think the decision to unroll loops by compilers can be influenced by various factors, including instruction pipelining, instruction-level parallelism, and memory access patterns. Unrolling loops can help expose more opportunities for the compiler to optimize instruction scheduling and reduce loop cost, potentially improving performance.
In your case, since you are dealing with memory-bound operations, the main bottleneck is probably memory access rather than computation. Unrolling loops could help improve performance by increasing memory prefetching opportunities and reducing loop cost.