是否可以向量化 myNum += a[b[i]] * c[i];在 x86_64 上？

发布于 2024-08-22 20:48:56 字数 195 浏览 9 评论 0原文

我将使用哪些内在函数在 x86_64 上对以下内容进行矢量化（如果甚至可以矢量化）？

double myNum = 0;
for(int i=0;i<n;i++){
    myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}

原文

What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64?

double myNum = 0;
for(int i=0;i<n;i++){
    myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迟到的我 2024-08-29 20:48:56

这是我的尝试，经过全面优化和测试：

#include <emmintrin.h>

__m128d sum = _mm_setzero_pd();
for(int i=0; i<n; i+=2) {
    sum = _mm_add_pd(sum, _mm_mul_pd(
        _mm_loadu_pd(c + i),
        _mm_setr_pd(a[b[i]], a[b[i+1]])
    ));
}

if(n & 1) {
    sum = _mm_add_pd(sum, _mm_set_sd(a[b[n-1]] * c[n-1]));
}

double finalSum = _mm_cvtsd_f64(_mm_add_pd(
    sum, _mm_shuffle_pd(sum, sum, _MM_SHUFFLE2(0, 1))
));

这使用 gcc -O2 -msse2 (4.4.1) 生成非常漂亮的汇编代码。

正如您所知，使用偶数的 n 将使该循环运行得更快，并且对齐的 c 也会更快。如果您可以对齐 c，请将 _mm_loadu_pd 更改为 _mm_load_pd 以获得更快的执行时间。

Here's my go at it, fully optimized and tested:

#include <emmintrin.h>

__m128d sum = _mm_setzero_pd();
for(int i=0; i<n; i+=2) {
    sum = _mm_add_pd(sum, _mm_mul_pd(
        _mm_loadu_pd(c + i),
        _mm_setr_pd(a[b[i]], a[b[i+1]])
    ));
}

if(n & 1) {
    sum = _mm_add_pd(sum, _mm_set_sd(a[b[n-1]] * c[n-1]));
}

double finalSum = _mm_cvtsd_f64(_mm_add_pd(
    sum, _mm_shuffle_pd(sum, sum, _MM_SHUFFLE2(0, 1))
));

This produces very beautiful assembly code using gcc -O2 -msse2 (4.4.1).

As you can tell, having an even n will make this loop go faster as well as an aligned c. If you can align c, change _mm_loadu_pd to _mm_load_pd for an even faster execution times.

回复收藏 0 原文

扭转时空 2024-08-29 20:48:56

我将从展开循环开始。希望之类的东西

double myNum1 = 0, myNum2=0;
for(int i=0;i<n;i+=2)
{
    myNum1 += a[b[ i ]] * c[ i ];
    myNum2 += a[b[i+1]] * c[i+1];
}
// ...extra code to handle the remainder when n isn't a multiple of 2...
double myNum = myNum1 + myNum2;

允许编译器将负载与算术交错；配置文件并查看装配体，看看是否有改进。理想情况下，编译器会生成 SSE 指令，但如果实际发生这种情况，我不会生成 SSE 指令。

进一步展开可能会让你这样做：（

__m128d sum0, sum1;
// ...initialize to zero...
for(int i=0;i<n;i+=4)
{
    double temp0 = a[b[ i ]] * c[ i ];
    double temp1 = a[b[i+1]] * c[i+1];
    double temp2 = a[b[i+2]] * c[i+2];
    double temp3 = a[b[i+3]] * c[i+3];
    __m128d pair0 = _mm_set_pd(temp0, temp1);
    __m128d pair1 = _mm_set_pd(temp2, temp3);
    sum0 = _mm_add_pd(sum0, pair0);
    sum1 = _mm_add_pd(sum1, pair1);
}
// ...extra code to handle the remainder when n isn't a multiple of 4...
// ...add sum0 and sum1, then add the result's components...

对开头和结尾的伪代码表示歉意，我认为重要的部分是循环）。我不确定这是否会更快；这取决于各种延迟以及编译器重新安排一切的能力。确保在之前和之后进行分析，看看是否有实际的改进。

希望有帮助。

I would start by unrolling the loop. Something like

double myNum1 = 0, myNum2=0;
for(int i=0;i<n;i+=2)
{
    myNum1 += a[b[ i ]] * c[ i ];
    myNum2 += a[b[i+1]] * c[i+1];
}
// ...extra code to handle the remainder when n isn't a multiple of 2...
double myNum = myNum1 + myNum2;

Hopefully that allows the compiler to interleave the loads with the arithmetic; profile and look at the assembly to see if there's an improvement. Ideally the compiler will generate SSE instructions, but I'm not if that happens in practice.

Unrolling further might let you do this:

__m128d sum0, sum1;
// ...initialize to zero...
for(int i=0;i<n;i+=4)
{
    double temp0 = a[b[ i ]] * c[ i ];
    double temp1 = a[b[i+1]] * c[i+1];
    double temp2 = a[b[i+2]] * c[i+2];
    double temp3 = a[b[i+3]] * c[i+3];
    __m128d pair0 = _mm_set_pd(temp0, temp1);
    __m128d pair1 = _mm_set_pd(temp2, temp3);
    sum0 = _mm_add_pd(sum0, pair0);
    sum1 = _mm_add_pd(sum1, pair1);
}
// ...extra code to handle the remainder when n isn't a multiple of 4...
// ...add sum0 and sum1, then add the result's components...

(apologies for the pseudocode at the start and end, I figure the important part was the loop). I don't know for sure whether that will be faster; it depends on the various latencies and how well the compiler can rearrange everything. Make sure you profile before and after to see whether there was an actual improvement.

Hope that helps.

回复收藏 0 原文

北风几吹夏 2024-08-29 20:48:56

英特尔处理器可以发出两个浮点运算，但每个周期一次加载，因此内存访问是最严格的约束。考虑到这一点，我的目标首先是使用打包加载来减少加载指令的数量，并使用打包算术只是因为它很方便。从那以后我意识到内存带宽饱和可能是最大的问题，如果重点是让代码运行得更快而不是学习矢量化，那么所有对 SSE 指令的混乱可能都是过早的优化。

SSE

在不对 b 中的索引进行假设的情况下，尽可能少的负载需要展开循环四次。一个 128 位加载从 b 获取四个索引，两个 128 位加载每个从 c 获取一对相邻的双精度数，并收集所需的独立的 a 64 位负载。
对于串行代码来说，每四次迭代有 7 个周期。（如果对 a 的访问不能很好地缓存，则足以使我的内存带宽饱和）。我遗漏了一些烦人的事情，比如处理不是 4 倍数的迭代次数。

entry: ; (rdi,rsi,rdx,rcx) are (n,a,b,c)
  xorpd xmm0, xmm0
  xor r8, r8
loop:
  movdqa xmm1, [rdx+4*r8]
  movapd xmm2, [rcx+8*r8]
  movapd xmm3, [rcx+8*r8+8]
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm4, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm4, [rsi+8*r10]
  punpckhqdq xmm1, xmm1
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm5, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm5, [rsi+8*r10]
  add    r8,   4
  cmp    r8,   rdi
  mulpd  xmm2, xmm4
  mulpd  xmm3, xmm5
  addpd  xmm0, xmm2
  addpd  xmm0, xmm3
  jl loop

获取索引是最复杂的部分。 movdqa 从 16 字节对齐地址加载 128 位整数数据（Nehalem 因混合“整数”和“浮点”SSE 指令而导致延迟损失）。 punpckhqdq 将高 64 位移动到低 64 位，但是是在整数模式下，与更简单地命名的 movhlpd 不同。 32 位移位在通用寄存器中完成。 movhpd 将一个双精度数加载到 xmm 寄存器的上部，而不影响下部 - 这用于将 a 的元素直接加载到打包寄存器中。

这段代码明显比上面的代码快，而上面的代码又比简单代码快，并且在每个访问模式上，除了简单的情况 B[i] = i ，其中朴素循环实际上是最快的。我还尝试了一些类似 Fortran 中 SUM(A(B(:)),C(:)) 的函数，最终基本上相当于简单的循环。

我在具有 4GB DDR2-667 内存、4 个模块的 Q6600（65 nm Core 2，2.4Ghz）上进行了测试。
测试内存带宽约为 5333 MB/s，所以看起来我只看到一个通道。我正在使用 Debian 的 gcc 4.3.2-1.1、-O3 -Ffast-math -msse2 -Ftree-vectorize -std=gnu99 进行编译。

为了进行测试，我让 n 为一百万，初始化数组，使 a[b[i]] 和 c[i] 都相等1.0/(i+1)，具有几种不同的索引模式。一个为 a 分配一百万个元素，并将 b 设置为随机排列，另一个为 a 分配 10M 个元素，并每 10 个元素使用一次，最后一个分配 a 10M 个元素，并通过将 1 到 9 之间的随机数添加到 b[i] 来设置 b[i+1]。我使用 gettimeofday 计算一次调用需要多长时间，通过在数组上调用 clflush 来清除缓存，并测量每个函数的 1000 次试验。我使用 criterion（特别是 statistics 包中的核密度估计器）。

带宽

现在，关于带宽的实际重要说明。 5333MB/s，2.4Ghz 时钟，每个周期刚好超过两个字节。我的数据足够长，没有任何内容是可缓存的，如果一切都丢失，则将循环的运行时间乘以每次迭代加载的 (16+2*16+4*64) 字节，这几乎正好为我的系统提供了大约 5333MB/s 的带宽。如果没有 SSE，带宽应该很容易饱和。即使假设 a 已完全缓存，一次迭代仅读取 b 和 c 也会移动 12 字节的数据，天真的人可以开始新的迭代每第三个周期都有流水线。

假设 a 上的任何不完全缓存的情况都会使算术和指令计数不再是瓶颈。如果我的代码中的大部分加速来自于向 b 和 c 发出更少的负载，那么我不会感到惊讶，因此可以有更多空间来跟踪和推测过去的缓存未命中情况一个。

更广泛的硬件可能会产生更大的差异。运行三个 DDR3-1333 通道的 Nehalem 系统每个周期需要移动 3*10667/2.66 = 12.6 字节才能使内存带宽饱和。如果 a 适合缓存，这对于单个线程来说是不可能的 - 但在 64 字节时，行缓存在向量上的缺失很快就会增加 - 在我的循环中，缓存中缺失的四个负载中只有一个会出现平均所需带宽为 16 字节/周期。

Intel processors can issue two floating point operations but one load per cycle, so memory accesses are the tightest constraint. With that in mind, I aimed first to use packed loads to reduce the number of load instructions, and used packed arithmetic just because it was convenient. I've since realized that saturating memory bandwidth may be the biggest issue, and all the messing around with SSE instructions might have been premature optimization if the point was to make the code go fast rather than learn to vectorize.

SSE

The fewest possible loads with no assumption on the indices in b requires unrolling the loop four times. One 128 bit load gets four indices from b, two 128 bit loads each get a pair of adjacent doubles from c, and gathering a required independent 64-bit loads.
That's a floor of 7 cycles per four iterations for serial code. (enough to saturate my memory bandwidth if access to a doesn't cache nicely). I've left out some annoying things like handling a number of iterations which is not a multiple of 4.

entry: ; (rdi,rsi,rdx,rcx) are (n,a,b,c)
  xorpd xmm0, xmm0
  xor r8, r8
loop:
  movdqa xmm1, [rdx+4*r8]
  movapd xmm2, [rcx+8*r8]
  movapd xmm3, [rcx+8*r8+8]
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm4, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm4, [rsi+8*r10]
  punpckhqdq xmm1, xmm1
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm5, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm5, [rsi+8*r10]
  add    r8,   4
  cmp    r8,   rdi
  mulpd  xmm2, xmm4
  mulpd  xmm3, xmm5
  addpd  xmm0, xmm2
  addpd  xmm0, xmm3
  jl loop

Getting the indices out is the most complicated part. movdqa loads 128 bits of integer data from a 16 byte aligned address (Nehalem has latency penalties for mixing the "integer" and "float" SSE instructions). punpckhqdq moves high 64 bits to low 64 bits, but in integer mode unlike the more simply named movhlpd. 32 bit shifts are done in the general purpose registers. movhpd loads one double into the upper part of an xmm register without disturbing the lower part - this is used to load the elements of a directly into packed registers.

This code distinctly faster than the code above, which is in turn faster than the simple code, and on every access pattern but the simple case B[i] = i where the naive loop is actually fastest. I also tried a few thing like a function around SUM(A(B(:)),C(:)) in Fortran which ended up basically equivalent to the simple loop.

I tested on a Q6600 (65 nm Core 2 at 2.4Ghz) with 4GB of DDR2-667 memory, in 4 modules.
Testing memory bandwidth gives about 5333 MB/s, so it seems like I'm only seeing a single channel. I'm compiling with Debian's gcc 4.3.2-1.1, -O3 -Ffast-math -msse2 -Ftree-vectorize -std=gnu99.

For testing I'm letting n be one million, initializing the arrays so a[b[i]] and c[i] both equal 1.0/(i+1), with a few different patterns of indices. One allocates a with a million elements and sets b to a random permutation, another allocates a with 10M elements and uses every 10th, and the last allocates a with 10M elements and sets up b[i+1] by adding a random number from 1 to 9 to b[i]. I'm timing how long one call takes with gettimeofday, clearing the caches by calling clflush over the arrays, and measuring 1000 trials of each function. I plotted smoothed runtime distributions using some code from the guts of criterion (in particular, the kernel density estimator in the statistics package).

Bandwidth

Now, for the actual important note about bandwidth. 5333MB/s with 2.4Ghz clock is just over two bytes per cycle. My data is long enough that nothing should be cacheable, and multiplying the runtime of my loop by (16+2*16+4*64) bytes loaded per iteration if everything misses gives me almost exactly the ~5333MB/s bandwidth my system has. It should be pretty easy to saturate that bandwidth without SSE. Even assuming a were completely cached, just reading b and c for one iteration moves 12 bytes of data, and the naive can start a new iteration ever third cycle with pipelining.

Assuming anything less than complete caching on a makes arithmetic and instruction count even less of a bottleneck. I wouldn't be surprised if most of the speedup in my code comes from issuing fewer loads to b and c so more space is free to track and speculate past cache misses on a.

Wider hardware might make more difference. A Nehalem system running three channels of DDR3-1333 would need to move 3*10667/2.66 = 12.6 bytes per cycle to saturate memory bandwidth. That would be impossible for a single thread if a fits in cache - but at 64 bytes a line cache misses on the vector add up quickly - just one of the four loads in my loop missing in caches brings up the average required bandwidth to 16 bytes/cycle.

回复收藏 0 原文

老街孤人 2024-08-29 20:48:56

简短回答否。长答案是的，但效率不高。您将因执行不对齐加载而受到惩罚，这将抵消任何好处。除非您可以保证 b[i] 个连续索引对齐，否则

如果您事先知道索引是什么，那么在矢量化后您很可能会获得更差的性能，最好是展开并指定显式索引。我使用模板专业化和代码生成做了类似的事情。如果您有兴趣，我可以分享

来回答您的评论，您基本上必须专注于数组。立即尝试的最简单的方法是将循环阻塞两倍，分别加载低 a 和高 a，然后像平常一样使用 mm*_pd 。伪代码：

__m128d a, result;
for(i = 0; i < n; i +=2) {
  ((double*)(&a))[0] = A[B[i]];
  ((double*)(&a))[1] = A[B[i+1]];
  // you may also load B using packed integer instruction
  result = _mm_add_pd(result, _mm_mul_pd(a, (__m128d)(C[i])));
}

我不记得确切的函数名称，可能需要仔细检查。
另外，如果您知道不会出现别名问题，请对指针使用限制关键字。
这将使编译器更加积极。

short answer no. Long answer yes, but not efficiently. You will incur the penalty for doing nonaligned loads which will negate any kind of benefit. Unless you can guarantee that b[i] successive indices are aligned, you will most likely have worse performance after the vectorization

if you know beforehand what indices are, your best that is to unroll and specify explicit indices. I did something similar using template specialization and code generation. if your interested, I can share

to answer your comment, you basically have to concentrate on a array. Easiest thing to try right away is to block you loop by a factor of two, load low and high a separately, and then use mm*_pd like usually. Pseudocode:

__m128d a, result;
for(i = 0; i < n; i +=2) {
  ((double*)(&a))[0] = A[B[i]];
  ((double*)(&a))[1] = A[B[i+1]];
  // you may also load B using packed integer instruction
  result = _mm_add_pd(result, _mm_mul_pd(a, (__m128d)(C[i])));
}

I do not remember function names exactly, may want to double check.
Also, use restrict keyword with the pointers if you know there can be no aliasing issues.
This will allow compiler to be much more aggressive.

回复收藏 0 原文