如何使用 SSE 指令集对 2 个双精度型或 4 个浮点型进行绝对运算？（最高 SSE4）

发布于 2024-10-29 02:22:21 字数 606 浏览 1 评论 0原文

这是我尝试使用 SSE 加速的示例 C 代码，两个数组的长度为 3072 个元素，带有双精度数，如果我不需要双精度数的精度，可以将其降低为浮点型。

double sum = 0.0;

for(k = 0; k < 3072; k++) {
    sum += fabs(sima[k] - simb[k]);
}

double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));

无论如何，我当前的问题是如何在 SSE 寄存器中执行双精度或浮点数的 fabs 步骤，以便我可以将整个计算保留在 SSE 寄存器中，从而保持快速，并且我可以通过部分展开此循环来并行化所有步骤。

这是我找到的一些资源 fabs() asm或者可能是这个翻转标志 - SO 但是第二个的弱点是需要有条件检查。

原文

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.

double sum = 0.0;

for(k = 0; k < 3072; k++) {
    sum += fabs(sima[k] - simb[k]);
}

double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));

Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.

Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

束缚ｍ 2024-11-05 02:22:21

我建议使用按位和掩码。正值和负值具有相同的表示形式，仅最高有效位不同，正值为 0，负值为 1，请参见双精度数字格式。您可以使用其中之一：

inline __m128 abs_ps(__m128 x) {
    static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
    return _mm_andnot_ps(sign_mask, x);
}

inline __m128d abs_pd(__m128d x) {
    static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
    return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}

此外，展开循环以打破循环携带的依赖链可能是个好主意。由于这是非负值的总和，因此求和的顺序并不重要：

double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;

__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
    sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
    sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}

__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}

通过展开并打破依赖关系（sum1 和 sum2 现在是独立的），您可以让处理器按顺序执行加法。由于现代 CPU 上的指令是流水线式的，因此 CPU 可以在前一个指令完成之前开始处理新的指令。此外，按位运算是在单独的执行单元上执行的，CPU 实际上可以在与加法/减法相同的周期中执行它。我建议Agner Fog 的优化手册。

最后，我不推荐使用openMP。循环太小，在多个线程之间分配作业的开销可能大于任何潜在的好处。

I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:

inline __m128 abs_ps(__m128 x) {
    static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
    return _mm_andnot_ps(sign_mask, x);
}

inline __m128d abs_pd(__m128d x) {
    static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
    return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}

Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:

double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;

__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
    sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
    sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}

__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}

By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.

Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.

回复收藏 0 原文

┾廆蒐ゝ 2024-11-05 02:22:21

-x 和x 的最大值应为abs(x)。这是代码：

x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)

The maximum of -x and x should be abs(x). Here it is in code:

x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)

回复收藏 0 原文

贪恋 2024-11-05 02:22:21

最简单的方法可能如下：

__m128d vsum = _mm_set1_pd(0.0);        // init partial sums
for (k = 0; k < 3072; k += 2)
{
    __m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
    __m128d vb = _mm_load_pd(&simb[k]);
    __m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
    __m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
    __m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff);        // calc abs diff = max(diff, - diff)
    vsum = _mm_add_pd(vsum, vabsdiff);  // accumulate two partial sums
}

请注意，这可能不会比现代 x86 CPU 上的标量代码快，后者通常有两个 FPU。然而，如果您可以降低到单精度，那么您很可能会获得 2 倍的吞吐量改进。

另请注意，您需要在循环后将 vsum 中的两个部分和合并为一个标量值，但这相当简单，而且对性能并不关键。

Probably the easiest way is as follows:

__m128d vsum = _mm_set1_pd(0.0);        // init partial sums
for (k = 0; k < 3072; k += 2)
{
    __m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
    __m128d vb = _mm_load_pd(&simb[k]);
    __m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
    __m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
    __m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff);        // calc abs diff = max(diff, - diff)
    vsum = _mm_add_pd(vsum, vabsdiff);  // accumulate two partial sums
}

Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.

Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.

回复收藏 0 原文

~没有更多了~