是否可以向量化 myNum += a[b[i]] * c[i];在 x86_64 上?
我将使用哪些内在函数在 x86_64 上对以下内容进行矢量化(如果甚至可以矢量化)?
double myNum = 0;
for(int i=0;i<n;i++){
myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}
What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64?
double myNum = 0;
for(int i=0;i<n;i++){
myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是我的尝试,经过全面优化和测试:
这使用
gcc -O2 -msse2
(4.4.1) 生成非常漂亮的汇编代码。正如您所知,使用偶数的
n
将使该循环运行得更快,并且对齐的c
也会更快。如果您可以对齐c
,请将_mm_loadu_pd
更改为_mm_load_pd
以获得更快的执行时间。Here's my go at it, fully optimized and tested:
This produces very beautiful assembly code using
gcc -O2 -msse2
(4.4.1).As you can tell, having an even
n
will make this loop go faster as well as an alignedc
. If you can alignc
, change_mm_loadu_pd
to_mm_load_pd
for an even faster execution times.我将从展开循环开始。 希望之类的东西
允许编译器将负载与算术交错;配置文件并查看装配体,看看是否有改进。理想情况下,编译器会生成 SSE 指令,但如果实际发生这种情况,我不会生成 SSE 指令。
进一步展开可能会让你这样做:(
对开头和结尾的伪代码表示歉意,我认为重要的部分是循环)。我不确定这是否会更快;这取决于各种延迟以及编译器重新安排一切的能力。确保在之前和之后进行分析,看看是否有实际的改进。
希望有帮助。
I would start by unrolling the loop. Something like
Hopefully that allows the compiler to interleave the loads with the arithmetic; profile and look at the assembly to see if there's an improvement. Ideally the compiler will generate SSE instructions, but I'm not if that happens in practice.
Unrolling further might let you do this:
(apologies for the pseudocode at the start and end, I figure the important part was the loop). I don't know for sure whether that will be faster; it depends on the various latencies and how well the compiler can rearrange everything. Make sure you profile before and after to see whether there was an actual improvement.
Hope that helps.
英特尔处理器可以发出两个浮点运算,但每个周期一次加载,因此内存访问是最严格的约束。考虑到这一点,我的目标首先是使用打包加载来减少加载指令的数量,并使用打包算术只是因为它很方便。从那以后我意识到内存带宽饱和可能是最大的问题,如果重点是让代码运行得更快而不是学习矢量化,那么所有对 SSE 指令的混乱可能都是过早的优化。
SSE
在不对
b
中的索引进行假设的情况下,尽可能少的负载需要展开循环四次。一个 128 位加载从b
获取四个索引,两个 128 位加载每个从c
获取一对相邻的双精度数,并收集所需的独立的a
64 位负载。对于串行代码来说,每四次迭代有 7 个周期。 (如果对
a
的访问不能很好地缓存,则足以使我的内存带宽饱和)。我遗漏了一些烦人的事情,比如处理不是 4 倍数的迭代次数。获取索引是最复杂的部分。
movdqa
从 16 字节对齐地址加载 128 位整数数据(Nehalem 因混合“整数”和“浮点”SSE 指令而导致延迟损失)。punpckhqdq
将高 64 位移动到低 64 位,但是是在整数模式下,与更简单地命名的movhlpd
不同。 32 位移位在通用寄存器中完成。movhpd
将一个双精度数加载到 xmm 寄存器的上部,而不影响下部 - 这用于将a
的元素直接加载到打包寄存器中。这段代码明显比上面的代码快,而上面的代码又比简单代码快,并且在每个访问模式上,除了简单的情况
B[i] = i
,其中朴素循环实际上是最快的。我还尝试了一些类似 Fortran 中SUM(A(B(:)),C(:))
的函数,最终基本上相当于简单的循环。我在具有 4GB DDR2-667 内存、4 个模块的 Q6600(65 nm Core 2,2.4Ghz)上进行了测试。
测试内存带宽约为 5333 MB/s,所以看起来我只看到一个通道。我正在使用 Debian 的 gcc 4.3.2-1.1、-O3 -Ffast-math -msse2 -Ftree-vectorize -std=gnu99 进行编译。
为了进行测试,我让
n
为一百万,初始化数组,使a[b[i]]
和c[i]
都相等1.0/(i+1)
,具有几种不同的索引模式。一个为a
分配一百万个元素,并将b
设置为随机排列,另一个为a
分配 10M 个元素,并每 10 个元素使用一次,最后一个分配a
10M 个元素,并通过将 1 到 9 之间的随机数添加到b[i]
来设置b[i+1]
。我使用gettimeofday
计算一次调用需要多长时间,通过在数组上调用clflush
来清除缓存,并测量每个函数的 1000 次试验。我使用 criterion(特别是statistics
包中的核密度估计器)。带宽
现在,关于带宽的实际重要说明。 5333MB/s,2.4Ghz 时钟,每个周期刚好超过两个字节。我的数据足够长,没有任何内容是可缓存的,如果一切都丢失,则将循环的运行时间乘以每次迭代加载的 (16+2*16+4*64) 字节,这几乎正好为我的系统提供了大约 5333MB/s 的带宽。如果没有 SSE,带宽应该很容易饱和。即使假设
a
已完全缓存,一次迭代仅读取b
和c
也会移动 12 字节的数据,天真的人可以开始新的迭代每第三个周期都有流水线。假设
a
上的任何不完全缓存的情况都会使算术和指令计数不再是瓶颈。如果我的代码中的大部分加速来自于向b
和c
发出更少的负载,那么我不会感到惊讶,因此可以有更多空间来跟踪和推测过去的缓存未命中情况一个
。更广泛的硬件可能会产生更大的差异。运行三个 DDR3-1333 通道的 Nehalem 系统每个周期需要移动 3*10667/2.66 = 12.6 字节才能使内存带宽饱和。如果
a
适合缓存,这对于单个线程来说是不可能的 - 但在 64 字节时,行缓存在向量上的缺失很快就会增加 - 在我的循环中,缓存中缺失的四个负载中只有一个会出现平均所需带宽为 16 字节/周期。Intel processors can issue two floating point operations but one load per cycle, so memory accesses are the tightest constraint. With that in mind, I aimed first to use packed loads to reduce the number of load instructions, and used packed arithmetic just because it was convenient. I've since realized that saturating memory bandwidth may be the biggest issue, and all the messing around with SSE instructions might have been premature optimization if the point was to make the code go fast rather than learn to vectorize.
SSE
The fewest possible loads with no assumption on the indices in
b
requires unrolling the loop four times. One 128 bit load gets four indices fromb
, two 128 bit loads each get a pair of adjacent doubles fromc
, and gatheringa
required independent 64-bit loads.That's a floor of 7 cycles per four iterations for serial code. (enough to saturate my memory bandwidth if access to
a
doesn't cache nicely). I've left out some annoying things like handling a number of iterations which is not a multiple of 4.Getting the indices out is the most complicated part.
movdqa
loads 128 bits of integer data from a 16 byte aligned address (Nehalem has latency penalties for mixing the "integer" and "float" SSE instructions).punpckhqdq
moves high 64 bits to low 64 bits, but in integer mode unlike the more simply namedmovhlpd
. 32 bit shifts are done in the general purpose registers.movhpd
loads one double into the upper part of an xmm register without disturbing the lower part - this is used to load the elements ofa
directly into packed registers.This code distinctly faster than the code above, which is in turn faster than the simple code, and on every access pattern but the simple case
B[i] = i
where the naive loop is actually fastest. I also tried a few thing like a function aroundSUM(A(B(:)),C(:))
in Fortran which ended up basically equivalent to the simple loop.I tested on a Q6600 (65 nm Core 2 at 2.4Ghz) with 4GB of DDR2-667 memory, in 4 modules.
Testing memory bandwidth gives about 5333 MB/s, so it seems like I'm only seeing a single channel. I'm compiling with Debian's gcc 4.3.2-1.1, -O3 -Ffast-math -msse2 -Ftree-vectorize -std=gnu99.
For testing I'm letting
n
be one million, initializing the arrays soa[b[i]]
andc[i]
both equal1.0/(i+1)
, with a few different patterns of indices. One allocatesa
with a million elements and setsb
to a random permutation, another allocatesa
with 10M elements and uses every 10th, and the last allocatesa
with 10M elements and sets upb[i+1]
by adding a random number from 1 to 9 tob[i]
. I'm timing how long one call takes withgettimeofday
, clearing the caches by callingclflush
over the arrays, and measuring 1000 trials of each function. I plotted smoothed runtime distributions using some code from the guts of criterion (in particular, the kernel density estimator in thestatistics
package).Bandwidth
Now, for the actual important note about bandwidth. 5333MB/s with 2.4Ghz clock is just over two bytes per cycle. My data is long enough that nothing should be cacheable, and multiplying the runtime of my loop by (16+2*16+4*64) bytes loaded per iteration if everything misses gives me almost exactly the ~5333MB/s bandwidth my system has. It should be pretty easy to saturate that bandwidth without SSE. Even assuming
a
were completely cached, just readingb
andc
for one iteration moves 12 bytes of data, and the naive can start a new iteration ever third cycle with pipelining.Assuming anything less than complete caching on
a
makes arithmetic and instruction count even less of a bottleneck. I wouldn't be surprised if most of the speedup in my code comes from issuing fewer loads tob
andc
so more space is free to track and speculate past cache misses ona
.Wider hardware might make more difference. A Nehalem system running three channels of DDR3-1333 would need to move 3*10667/2.66 = 12.6 bytes per cycle to saturate memory bandwidth. That would be impossible for a single thread if
a
fits in cache - but at 64 bytes a line cache misses on the vector add up quickly - just one of the four loads in my loop missing in caches brings up the average required bandwidth to 16 bytes/cycle.简短回答 否。长答案是的,但效率不高。您将因执行不对齐加载而受到惩罚,这将抵消任何好处。除非您可以保证 b[i] 个连续索引对齐,否则
如果您事先知道索引是什么,那么在矢量化后您很可能会获得更差的性能,最好是展开并指定显式索引。我使用模板专业化和代码生成做了类似的事情。如果您有兴趣,我可以分享
来回答您的评论,您基本上必须专注于数组。立即尝试的最简单的方法是将循环阻塞两倍,分别加载低 a 和高 a,然后像平常一样使用 mm*_pd 。伪代码:
我不记得确切的函数名称,可能需要仔细检查。
另外,如果您知道不会出现别名问题,请对指针使用限制关键字。
这将使编译器更加积极。
short answer no. Long answer yes, but not efficiently. You will incur the penalty for doing nonaligned loads which will negate any kind of benefit. Unless you can guarantee that b[i] successive indices are aligned, you will most likely have worse performance after the vectorization
if you know beforehand what indices are, your best that is to unroll and specify explicit indices. I did something similar using template specialization and code generation. if your interested, I can share
to answer your comment, you basically have to concentrate on a array. Easiest thing to try right away is to block you loop by a factor of two, load low and high a separately, and then use mm*_pd like usually. Pseudocode:
I do not remember function names exactly, may want to double check.
Also, use restrict keyword with the pointers if you know there can be no aliasing issues.
This will allow compiler to be much more aggressive.
由于数组索引的双重间接寻址,这不会按原样进行矢量化。由于您使用的是双精度数,因此从 SSE 中几乎没有什么好处,特别是因为大多数现代 CPU 都有 2 个 FPU。
This is not going to vectorize as it is, because of the double indirection of the array indices. Since you're working with doubles there is little or nothing to be gained from SSE, particularly as most modern CPUs have 2 FPUs anyway.