如何使用 C 中的 SSE 内在函数计算单向量点积

发布于 2024-10-01 05:50:46 字数 318 浏览 11 评论 0原文

我试图将两个向量相乘，其中一个向量的每个元素都乘以另一个向量的相同索引中的元素。然后我想对结果向量的所有元素求和以获得一个数字。例如，向量 {1,2,3,4} 和 {5,6,7,8} 的计算如下所示：

1*5 + 2*6 + 3*7 + 4*8

本质上，我取两个向量的点积。我知道有一个 SSE 命令可以执行此操作，但该命令没有与之关联的内在函数。此时，我不想在 C 代码中编写内联汇编，因此我只想使用内部函数。这似乎是一个常见的计算，所以我对自己在谷歌上找不到答案感到惊讶。

注意：我正在针对支持 SSE 4.2 的特定微架构进行优化。

原文

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

1*5 + 2*6 + 3*7 + 4*8

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

残月升风 2024-10-08 05:50:46

如果您要对较长向量进行点积，请在内循环中使用乘法和常规 _mm_add_ps（或 FMA）。保存水平总和直至结束。

但是，如果您只对一对 SIMD 向量进行点积：

GCC（至少版本 4.3）包含带有 SSE4.1 级别内在函数的，包括 single 和 double - 精度点积：

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

在 Intel 主流 CPU（不是 Atom/Silvermont）上，这些比使用多条指令手动执行要快一些。

但在 AMD（包括 Ryzen）上，dpps 速度明显慢一些。（请参阅Agner Fog 的指令表）

作为旧处理器的后备方案，您可以使用此算法来创建点积向量 a 和 b：

__m128 r1 = _mm_mul_ps(a, b);

然后使用在 x86 上进行水平浮点向量和的最快方法（请参阅此处的评论版本，以及为什么它更快。）

__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);

较慢的替代方案每个 hadd 需要 2 次 shuffle，这很容易成为 shuffle 吞吐量的瓶颈，尤其是在 Intel CPU 上。

r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);

If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.

But if you are doing a dot product of just a single pair of SIMD vectors:

GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)

As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

__m128 r1 = _mm_mul_ps(a, b);

and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)

__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);

A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);

回复收藏 0 原文

梦途 2024-10-08 05:50:46

我想说最快的 SSE 方法是：

static inline float CalcDotProductSse(__m128 x, __m128 y) {
    __m128 mulRes, shufReg, sumsReg;
    mulRes = _mm_mul_ps(x, y);

    // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
    shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
    sumsReg = _mm_add_ps(mulRes, shufReg);
    shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
    sumsReg = _mm_add_ss(sumsReg, shufReg);
    return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}

我遵循 - 在 x86 上进行水平浮点向量和的最快方法。

I'd say the fastest SSE method would be:

static inline float CalcDotProductSse(__m128 x, __m128 y) {
    __m128 mulRes, shufReg, sumsReg;
    mulRes = _mm_mul_ps(x, y);

    // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
    shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
    sumsReg = _mm_add_ps(mulRes, shufReg);
    shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
    sumsReg = _mm_add_ss(sumsReg, shufReg);
    return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}

I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.

回复收藏 0 原文

只是我以为 2024-10-08 05:50:46

我写了这个并用 gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c 编译它

void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
       int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
       int * __restrict__ o)
{
    int i;

    for (i = 0; i < 8; ++i)
        o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
}

并且 GCC 4.3.0 自动矢量化它：

sse.c:5: note: LOOP VECTORIZED.
sse.c:2: note: vectorized 1 loops in function.

但是，它只会做如果我使用具有足够迭代次数的循环，否则详细的输出将表明矢量化无利可图或循环太小。如果没有 __restrict__ 关键字，它必须生成单独的非向量化版本来处理输出 o 可能指向输入之一的情况。

我将粘贴这些指令作为示例，但由于矢量化的一部分展开了循环，因此可读性不太好。

I wrote this and compiled it with gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c

void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
       int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
       int * __restrict__ o)
{
    int i;

    for (i = 0; i < 8; ++i)
        o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
}

And GCC 4.3.0 auto-vectorized it:

sse.c:5: note: LOOP VECTORIZED.
sse.c:2: note: vectorized 1 loops in function.

However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the __restrict__ keywords it has to generate separate, non-vectorized versions to deal with cases where the output o may point into one of the inputs.

I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.

回复收藏 0 原文