如何使用 C 中的 SSE 内在函数计算单向量点积
我试图将两个向量相乘,其中一个向量的每个元素都乘以另一个向量的相同索引中的元素。然后我想对结果向量的所有元素求和以获得一个数字。例如,向量 {1,2,3,4} 和 {5,6,7,8} 的计算如下所示:
1*5 + 2*6 + 3*7 + 4*8
本质上,我取两个向量的点积。我知道有一个 SSE 命令可以执行此操作,但该命令没有与之关联的内在函数。此时,我不想在 C 代码中编写内联汇编,因此我只想使用内部函数。这似乎是一个常见的计算,所以我对自己在谷歌上找不到答案感到惊讶。
注意:我正在针对支持 SSE 4.2 的特定微架构进行优化。
I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
1*5 + 2*6 + 3*7 + 4*8
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您要对较长向量进行点积,请在内循环中使用乘法和常规
_mm_add_ps
(或 FMA)。保存水平总和直至结束。但是,如果您只对一对 SIMD 向量进行点积:
GCC(至少版本 4.3)包含带有 SSE4.1 级别内在函数的
,包括 single 和 double - 精度点积:在 Intel 主流 CPU(不是 Atom/Silvermont)上,这些比使用多条指令手动执行要快一些。
但在 AMD(包括 Ryzen)上,
dpps
速度明显慢一些。 (请参阅Agner Fog 的指令表)作为旧处理器的后备方案,您可以使用此算法来创建点积向量
a
和b
:然后使用 在 x86 上进行水平浮点向量和的最快方法(请参阅此处的评论版本,以及为什么它更快。)
较慢的替代方案每个
hadd
需要 2 次 shuffle,这很容易成为 shuffle 吞吐量的瓶颈,尤其是在 Intel CPU 上。If you're doing a dot-product of longer vectors, use multiply and regular
_mm_add_ps
(or FMA) inside the inner loop. Save the horizontal sum until the end.But if you are doing a dot product of just a single pair of SIMD vectors:
GCC (at least version 4.3) includes
<smmintrin.h>
with SSE4.1 level intrinsics, including the single and double-precision dot products:On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.
But on AMD (including Ryzen),
dpps
is significantly slower. (See Agner Fog's instruction tables)As a fallback for older processors, you can use this algorithm to create the dot product of the vectors
a
andb
:and then horizontal sum
r1
using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)A slow alternative costs 2 shuffles per
hadd
, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.我想说最快的 SSE 方法是:
我遵循 - 在 x86 上进行水平浮点向量和的最快方法。
I'd say the fastest SSE method would be:
I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.
我写了这个并用 gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c 编译它
并且 GCC 4.3.0 自动矢量化它:
但是,它只会做如果我使用具有足够迭代次数的循环,否则详细的输出将表明矢量化无利可图或循环太小。如果没有
__restrict__
关键字,它必须生成单独的非向量化版本来处理输出o
可能指向输入之一的情况。我将粘贴这些指令作为示例,但由于矢量化的一部分展开了循环,因此可读性不太好。
I wrote this and compiled it with
gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c
And GCC 4.3.0 auto-vectorized it:
However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the
__restrict__
keywords it has to generate separate, non-vectorized versions to deal with cases where the outputo
may point into one of the inputs.I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.
英特尔此处有一篇文章涉及点积实现。
There is an article by Intel here which touches on dot-product implementations.