进行水平 SSE 向量和(或其他简化)的最快方法

发布于 2024-11-29 04:36:41 字数 287 浏览 2 评论 0原文

给定一个由三个(或四个)浮点数组成的向量。对它们求和的最快方法是什么?

SSE(movaps、shuffle、add、movd)总是比 x87 快吗? SSE3 中的水平相加指令值得吗?

转移到 FPU,然后是 faddp、faddp 的成本是多少?最快的具体指令序列是什么?

“尝试安排一些事情,以便一次可以对四个向量求和”将不会被接受作为答案。 :-) 例如,为了对数组求和,您可以使用多个向量累加器进行垂直求和(以隐藏 addps 延迟),并在循环后减少到 1,但随后您需要对最后一个向量进行水平求和。

Given a vector of three (or four) floats. What is the fastest way to sum them?

Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?

What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?

"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-) e.g. for summing an array, you can use multiple vector accumulators for vertical sums (to hide addps latency), and reduce down to one after the loop, but then you need to horizontally sum that last vector.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

听,心雨的声音 2024-12-06 04:36:41

一般来说,对于任何类型的向量水平缩减,提取/洗牌高半部分以与低部分对齐,然后垂直相加(或最小/最大/或/和/异或/乘/其他);重复直到只有一个元素(向量的其余部分中有大量垃圾)。

如果您从宽度超过 128 位的向量开始,将其缩小一半,直到达到 128 位(然后您可以在该向量上使用此答案中的函数之一)。但如果您需要将结果广播到最后的所有元素,那么您可以考虑一路进行全角洗牌。

更广泛的向量和整数以及 FP

__m128__m128d 的相关问答

  • (见下文)

  • __m256d 对 Ryzen 1 与 Intel 进行性能分析(说明为什么 vextractf128vperm2f128 好得多)使用 SSE/AVX 获取 __m256d 中存储的值的总和

  • __m256 如何水平求和__m256?

  • <一个href="https://stackoverflow.com/questions/10454150/intel-avx-256-bits-version-of-dot-product-for-double- precision-floating-point-v/47445367#47445367">英特尔 AVX :单向量的双精度浮点变量的点积的 256 位版本。

  • 数组的点积(不仅仅是 3 或 4 个元素的单个向量):对 多个累加器,最后是hsum。 完整的 AVX+FMA 数组点积示例,包括循环后的高效 hsum。 (对于数组的简单求和或其他缩减,请使用该模式但不使用乘法部分,例如 add 而不是 fma)。不要为每个 SIMD 向量单独进行水平工作;最后执行一次。

    如何使用 SIMD 计算字符出现次数作为整数示例计数 _mm256_cmpeq_epi8 匹配,再次在整个数组上,仅在末尾进行 hsumming。 (值得特别提及的是,先进行一些 8 位累加,然后扩大 8 -> 64 位以避免溢出,但此时无需执行完整的 hsum。)

整数


这个问题的主要答案:主要是浮动和__m128

以下是根据 Agner Fog 的微架构指南 的微架构指南调整的一些版本指令表。另请参阅 标签维基。它们在任何 CPU 上都应该高效,没有重大瓶颈。 (例如,我避免了那些对一个 uarch 有一点帮助但对另一个 uarch 来说很慢的事情)。代码大小也被最小化。

常见的 SSE3 / SSSE3 2x hadd 习惯用法仅适用于代码大小,而不适用于任何现有 CPU 的速度。它有一些用例(例如转置和添加,见下文),但单个向量不是其中之一。

我还包含了 AVX 版本。使用 AVX / AVX2 进行的任何类型的水平缩减都应以 vextractf128 和“垂直”操作开始,以缩减为一个 XMM (__m128) 向量。一般来说,对于宽向量,最好的选择是反复缩小一半,直到缩小到 128 位向量,无论元素类型如何。 (除了 8 位整数,如果您想对 hsum 进行求和而不溢出到更宽的元素,那么首先要使用 vpsadbw。)

查看所有这些代码的 asm 输出 关于 Godbolt 编译器资源管理器另请参阅我对 Agner Fog 的 C++ 矢量类库 horizo​​ntal_add 函数。 (留言板线程,以及 github)。我使用 CPP 宏为 SSE2、SSE4 和 AVX 的代码大小选择最佳洗牌,并在 AVX 不可用时避免 movdqa


需要考虑一些权衡:

  • 代码大小:出于 L1 I-cache 的原因以及从磁盘获取代码(较小的二进制文件),越小越好。二进制总大小对于整个程序中重复做出的编译器决策至关重要。如果您费心用内部函数手动编写某些内容,那么如果它可以为整个程序提供任何加速,那么花费一些代码字节是值得的(请注意使展开看起来不错的微基准)。
  • uop-cache 大小:通常比 L1 I$ 更珍贵。 4 个单 uop 指令占用的空间少于 2 个 haddps,因此这在这里非常相关。
  • 延迟:有时相关的
  • 吞吐量(后端端口):通常不相关,水平总和不应该是最内循环中的唯一内容。端口压力仅作为包含该压力的整个回路的一部分才重要。
  • 吞吐量(总前端融合域 uops):如果周围的代码在 hsum 使用的同一端口上没有出现瓶颈,则这是 hsum 对整个吞吐量的影响的代理。

当水平添加不频繁时:

没有 uop-cache 的 CPU 可能会喜欢 2x haddps(如果很少使用):运行时速度很慢,但这种情况并不常见。只有 2 条指令可以最大限度地减少对周围代码(I$ 大小)的影响。

具有 uop 缓存的 CPU 可能会更喜欢需要更少 uop 的东西,即使它有更多的指令/更多的 x86 代码大小。使用的 uop 缓存行总数是我们想要最小化的,这并不像最小化 uop 总数那么简单(采用的分支和 32B 边界总是启动一个新的 uop 缓存行)。

不管怎样,话虽如此,水平总和会出现很多,所以这是我精心制作一些编译良好的版本的尝试。没有在任何真实硬件上进行基准测试,甚至没有经过仔细测试。洗牌常量或其他内容可能存在错误。


如果您正在制作代码的后备/基线版本,请记住只有旧的 CPU 才能运行它;较新的 CPU 将运行您的 AVX 版本或 SSE4.1 或其他版本。

K8、Core2(merom) 及更早版本等旧版 CPU 仅具有 64 位随机单元。 Core2 具有适用于大多数指令的 128 位执行单元,但不适用于洗牌。 (Pentium M 和 K8 将所有 128b 向量指令作为两个 64 位一半处理)。

movhlps 这样以 64 位块移动数据的混洗(在 64 位半部分内不进行混洗)也很快。

相关:新 CPU 上的随机播放,以及避免 Haswell 及更高版本上的 1/时钟随机播放吞吐量瓶颈的技巧:AVX512 中的 128 位跨通道操作是否能提供更好的性能?

在速度较慢的旧 CPU 上shuffles

  • movhlps(Merom:1uop)明显快于shufps(Merom:3uops)。在 Pentium-M 上,比 movaps 便宜。此外,它在 Core2 上的 FP 域中运行,避免了其他 shuffle 造成的旁路延迟。
  • unpcklpdunpcklps 更快。
  • pshufd 很慢,pshuflw/pshufhw 很快(因为它们只随机播放 64 位的一半)
  • pshufb mm0 (MMX ) 快,pshufb xmm0 慢。
  • haddps 非常慢(Merom 和 Pentium M 上为 6uops)
  • movshdup(Merom:1uop)很有趣:它是唯一在其中进行洗牌的 1uop insn 64b 元素。

Core2(包括 Penryn)上的 shufps 将数据带入整数域,导致绕过延迟将其返回到 addps 的 FP 执行单元,但 movhlps< /code> 完全属于 FP 域。 shufpd 也在浮点域中运行。

movshdup 在整数域中运行,但只有一个微指令。

AMD K10、Intel Core2(Penryn/Wolfdale) 以及所有更高版本的 CPU 将所有 xmm shuffle 作为单个 uop 运行。 (但请注意 Penryn 上使用 shufps 的旁路延迟,使用 movhlps 避免)


不使用 AVX,避免浪费 movaps/movdqa 指令需要仔细选择随机播放。只有少数随机播放起到复制和随机播放的作用,而不是修改目标。组合来自两个输入的数据的随机播放(例如 unpck*movhlps)可以与不再需要的 tmp 变量一起使用,而不是使用 _mm_movehl_ps(same,same)

通过使用虚拟参数作为初始洗牌的目标,其中一些可以变得更快(保存 MOVAPS),但更难看/不太“干净”。 例如:

// Use dummy = a recently-dead variable that vec depends on,
//  so it doesn't introduce a false dependency,
//  and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
    // With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
    (void)dummy;
    return _mm_unpackhi_pd(vec, vec);
#else
    // Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
    __m128 tmp = _mm_castpd_ps(dummy);
    __m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
    return high;
#endif
}

带有 SSE1 的 __m128 float(又名SSE):

float hsum_ps_sse1(__m128 v) {                                  // v = [ D C | B A ]
    __m128 shuf   = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));  // [ C D | A B ]
    __m128 sums   = _mm_add_ps(v, shuf);      // sums = [ D+C C+D | B+A A+B ]
    shuf          = _mm_movehl_ps(shuf, sums);      //  [   C   D | D+C C+D ]  // let the compiler avoid a mov by reusing shuf
    sums          = _mm_add_ss(sums, shuf);
    return    _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

    # clang 3.7.1 -O3:  
    movaps  xmm1, xmm0
    shufps  xmm1, xmm1, 177
    addps   xmm1, xmm0
    movaps  xmm0, xmm1
    shufpd  xmm0, xmm0, 1
    addss   xmm0, xmm1

我报告了一个关于悲观化的 clang bug随机播放。它有自己的洗牌内部表示,并将其变回洗牌。 gcc 更经常使用与您使用的内在函数直接匹配的指令。

通常,在指令选择未手动调整的代码中,clang 比 gcc 做得更好,或者即使内在函数对于非常量情况而言是最佳的,常量传播也可以简化事情。总的来说,编译器像一个适合内在函数的编译器一样工作,而不仅仅是一个汇编器,这是一件好事。编译器通常可以从标量 C 生成良好的 asm,但它甚至不会尝试按照良好的 asm 的方式工作。最终编译器将把内在函数视为另一个 C 运算符作为优化器的输入。


__m128 float with SSE3

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3: perfectly optimal code
    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

这有几个优点:

  • 不需要任何 movaps 副本来解决破坏性洗牌(无需 AVX):movshdup xmm1, xmm2 的目标是只写的,因此它为我们从死寄存器中创建tmp。这也是我使用 movehl_ps(tmp, sums) 而不是 movehl_ps(sums, sums) 的原因。

  • 代码大小小。混洗指令很小:movhlps 为 3 个字节,movshdup 为 4 个字节(与 shufps 相同)。不需要立即字节,因此对于 AVX,vshufps 是 5 个字节,但 vmovhlpsvmovshdup 都是 4 个字节。

我可以使用 <代码>addps而不是addss。由于这不会在内部循环中使用,因此切换额外晶体管的额外能量可能可以忽略不计。上面 3 个元素的 FP 异常不存在风险,因为所有元素都保存有效的 FP 数据。然而,clang/LLVM 实际上“理解”向量洗牌,并且如果它知道只有低元素重要,就会发出更好的代码。

与 SSE1 版本一样,向其自身添加奇数元素可能会导致 FP 异常(如溢出),否则不会发生这种情况,但这应该不是问题。非正规化很慢,但 IIRC 产生 +Inf 结果并不在大多数 uarches 上。


SSE3 针对代码大小进行优化

如果代码大小是您主要关心的问题,则两条 haddps (_mm_hadd_ps) 指令即可解决问题(Paul R 的回答)。这也是最容易输入和记住的。不过,它不快。即使 Intel Skylake 仍然将每个 haddps 解码为 3 uops,有 6 个周期的延迟。因此,尽管它节省了机器代码字节(L1 I-cache),但它在更有价值的 uop-cache 中占用了更多空间。 haddps 的真实用例:转置求和问题,或者在中间步骤进行一些缩放在此 SSE atoi() 实现中。


__m256 float with AVX:

此版本节省了一个代码字节与 Marat 对 AVX 的回答问题

#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}
#endif

 vmovaps xmm1,xmm0               # huh, what the heck gcc?  Just extract to xmm1
 vextractf128 xmm0,ymm0,0x1
 vaddps xmm0,xmm1,xmm0
 vmovshdup xmm1,xmm0
 vaddps xmm0,xmm1,xmm0
 vmovhlps xmm1,xmm1,xmm0
 vaddss xmm0,xmm0,xmm1
 vzeroupper 
 ret

__m128d double 双精度:

double hsum_pd_sse2(__m128d vd) {                      // v = [ B | A ]
    __m128 undef  = _mm_undefined_ps();                       // don't worry, we only use addSD, never touching the garbage bits with an FP add
    __m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd));  // there is no movhlpd
    __m128d shuf  = _mm_castps_pd(shuftmp);
    return  _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}

# gcc 5.3.0 -O3
    pxor    xmm1, xmm1          # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
    movhlps xmm1, xmm0
    addsd   xmm0, xmm1


# clang 3.7.1 -O3 again doesn't use movhlps:
    xorpd   xmm2, xmm2          # with  #define _mm_undefined_ps _mm_setzero_ps
    movapd  xmm1, xmm0
    unpckhpd        xmm1, xmm2
    addsd   xmm1, xmm0
    movapd  xmm0, xmm1    # another clang bug: wrong choice of operand order


// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
    double tmp;
    _mm_storeh_pd(&tmp, vd);       // store the high half
    double lo = _mm_cvtsd_f64(vd); // cast the low half
    return lo+tmp;
}

    # gcc 5.3 -O3
    haddpd  xmm0, xmm0   # Lower latency but less throughput than storing to memory

    # ICC13
    movhpd    QWORD PTR [-8+rsp], xmm0    # only needs the store port, not the shuffle unit
    addsd     xmm0, QWORD PTR [-8+rsp]

存储到内存并返回可避免 ALU uop。如果洗牌端口压力或一般的 ALU uops 是瓶颈,那么这很好。 (请注意,它不需要 sub rsp, 8 或任何内容,因为 x86-64 SysV ABI 提供了信号处理程序不会踩踏的红色区域。)

有些人存储到数组并对所有元素求和,但编译器通常不会意识到数组的低位元素仍然存在于存储之前的寄存器中。


__m128i int32_t 整数:

pshufd 是一种方便的复制和洗牌。不幸的是,位和字节移位是就地的,并且 punpckhqdq 将目标的高半部分放在结果的低半部分中,这与 movhlps 提取高部分的方式相反一半进入不同的寄存器。

在某些 CPU 上,第一步使用 movhlps 可能会很好,但前提是我们有一个暂存寄存器。 pshufd 是一个安全的选择,并且在 Merom 之后的所有操作上都很快。

int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
    __m128i hi64  = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // SSE2 movd
    //return _mm_extract_epi32(hl, 0);     // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}

    # gcc 5.3 -O3
    pshufd xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    pshuflw xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    movd   eax,xmm0

int hsum_epi32_ssse3_slow_smallcode(__m128i x){
    x = _mm_hadd_epi32(x, x);
    x = _mm_hadd_epi32(x, x);
    return _mm_cvtsi128_si32(x);
}

在某些 CPU 上,对整数数据使用 FP shuffle 是安全的。我没有这样做,因为在现代 CPU 上最多可以节省 1 或 2 个代码字节,并且没有速度增益(除了代码大小/对齐效果)。

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).

If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.

Related Q&As for wider vectors, and integers, and FP

Integer


Main answer to this question: mostly float and __m128

Here are some versions tuned based on Agner Fog's microarch guide's microarch guide and instruction tables. See also the tag wiki. They should be efficient on any CPU, with no major bottlenecks. (e.g. I avoided things that would help one uarch a bit but be slow on another uarch). Code-size is also minimized.

The common SSE3 / SSSE3 2x hadd idiom is only good for code-size, not speed on any existing CPUs. There are use-cases for it (like transpose and add, see below), but a single vector isn't one of them.

I've also included an AVX version. Any kind of horizontal reduction with AVX / AVX2 should start with a vextractf128 and a "vertical" operation to reduce down to one XMM (__m128) vector. In general for wide vectors, your best bet is to narrow in half repeatedly until you're down to a 128-bit vector, regardless of element type. (Except for 8-bit integer, then vpsadbw as a first step if you want to hsum without overflow to wider elements.)

See the asm output from all this code on the Godbolt Compiler Explorer. See also my improvements to Agner Fog's C++ Vector Class Library horizontal_add functions. (message board thread, and code on github). I used CPP macros to select optimal shuffles for code-size for SSE2, SSE4, and AVX, and for avoiding movdqa when AVX isn't available.


There are tradeoffs to consider:

  • code size: smaller is better for L1 I-cache reasons, and for code fetch from disk (smaller binaries). Total binary size mostly matters for compiler decisions made repeatedly all over a program. If you're bothering to hand-code something with intrinsics, it's worth spending a few code bytes if it gives any speedup for the whole program (be careful of microbenchmarks that make unrolling look good).
  • uop-cache size: Often more precious than L1 I$. 4 single-uop instructions can take less space than 2 haddps, so this is highly relevant here.
  • latency: Sometimes relevant
  • throughput (back-end ports): usually irrelevant, horizontal sums shouldn't be the only thing in an innermost loop. Port pressure matters only as part of the whole loop that contains this.
  • throughput (total front-end fused-domain uops): If surrounding code doesn't bottleneck on the same port that the hsum uses, this is a proxy for the impact of the hsum on the throughput of the whole thing.

When a horizontal add is infrequent:

CPUs with no uop-cache might favour 2x haddps if it's very rarely used: It's slowish when it does run, but that's not often. Being only 2 instructions minimizes the impact on the surrounding code (I$ size).

CPUs with a uop-cache will probably favour something that takes fewer uops, even if it's more instructions / more x86 code-size. Total uops cache-lines used is what we want to minimize, which isn't as simple as minimizing total uops (taken branches and 32B boundaries always start a new uop cache line).

Anyway, with that said, horizontal sums come up a lot, so here's my attempt at carefully crafting some versions that compile nicely. Not benchmarked on any real hardware, or even carefully tested. There might be bugs in the shuffle constants or something.


If you're making a fallback / baseline version of your code, remember that only old CPUs will run it; newer CPUs will run your AVX version, or SSE4.1 or whatever.

Old CPUs like K8, and Core2(merom) and earlier only have 64bit shuffle units. Core2 has 128bit execution units for most instructions, but not for shuffles. (Pentium M and K8 handle all 128b vector instructions as two 64bit halves).

Shuffles like movhlps that move data in 64-bit chunks (no shuffling within 64-bit halves) are fast, too.

Related: shuffles on new CPUs, and tricks for avoiding 1/clock shuffle throughput bottleneck on Haswell and later: Do 128bit cross lane operations in AVX512 give better performance?

On old CPUs with slow shuffles:

  • movhlps (Merom: 1uop) is significantly faster than shufps (Merom: 3uops). On Pentium-M, cheaper than movaps. Also, it runs in the FP domain on Core2, avoiding the bypass delays from other shuffles.
  • unpcklpd is faster than unpcklps.
  • pshufd is slow, pshuflw/pshufhw are fast (because they only shuffle a 64bit half)
  • pshufb mm0 (MMX) is fast, pshufb xmm0 is slow.
  • haddps is very slow (6uops on Merom and Pentium M)
  • movshdup (Merom: 1uop) is interesting: It's the only 1uop insn that shuffles within 64b elements.

shufps on Core2(including Penryn) brings data into the integer domain, causing a bypass delay to get it back to the FP execution units for addps, but movhlps is entirely in the FP domain. shufpd also runs in the float domain.

movshdup runs in the integer domain, but is only one uop.

AMD K10, Intel Core2(Penryn/Wolfdale), and all later CPUs, run all xmm shuffles as a single uop. (But note the bypass delay with shufps on Penryn, avoided with movhlps)


Without AVX, avoiding wasted movaps/movdqa instructions requires careful choice of shuffles. Only a few shuffles work as a copy-and-shuffle, rather than modifying the destination. Shuffles that combine data from two inputs (like unpck* or movhlps) can be used with a tmp variable that's no longer needed instead of _mm_movehl_ps(same,same).

Some of these can be made faster (save a MOVAPS) but uglier / less "clean" by taking a dummy arg for use as a destination for an initial shuffle. For example:

// Use dummy = a recently-dead variable that vec depends on,
//  so it doesn't introduce a false dependency,
//  and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
    // With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
    (void)dummy;
    return _mm_unpackhi_pd(vec, vec);
#else
    // Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
    __m128 tmp = _mm_castpd_ps(dummy);
    __m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
    return high;
#endif
}

__m128 float with SSE1 (aka SSE):

float hsum_ps_sse1(__m128 v) {                                  // v = [ D C | B A ]
    __m128 shuf   = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));  // [ C D | A B ]
    __m128 sums   = _mm_add_ps(v, shuf);      // sums = [ D+C C+D | B+A A+B ]
    shuf          = _mm_movehl_ps(shuf, sums);      //  [   C   D | D+C C+D ]  // let the compiler avoid a mov by reusing shuf
    sums          = _mm_add_ss(sums, shuf);
    return    _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

    # clang 3.7.1 -O3:  
    movaps  xmm1, xmm0
    shufps  xmm1, xmm1, 177
    addps   xmm1, xmm0
    movaps  xmm0, xmm1
    shufpd  xmm0, xmm0, 1
    addss   xmm0, xmm1

I reported a clang bug about pessimizing the shuffles. It has its own internal representation for shuffling, and turns that back into shuffles. gcc more often uses the instructions that directly match the intrinsic you used.

Often clang does better than gcc, in code where the instruction choice isn't hand-tuned, or constant-propagation can simplify things even when the intrinsics are optimal for the non-constant case. Overall it's a good thing that compilers work like a proper compiler for intrinsics, not just an assembler. Compilers can often generate good asm from scalar C that doesn't even try to work the way good asm would. Eventually compilers will treat intrinsics as just another C operator as input for the optimizer.


__m128 float with SSE3

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3: perfectly optimal code
    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

This has several advantages:

  • doesn't require any movaps copies to work around destructive shuffles (without AVX): movshdup xmm1, xmm2's destination is write-only, so it creates tmp out of a dead register for us. This is also why I used movehl_ps(tmp, sums) instead of movehl_ps(sums, sums).

  • small code-size. The shuffling instructions are small: movhlps is 3 bytes, movshdup is 4 bytes (same as shufps). No immediate byte is required, so with AVX, vshufps is 5 bytes but vmovhlps and vmovshdup are both 4.

I could save another byte with addps instead of addss. Since this won't be used inside inner loops, the extra energy to switch the extra transistors is probably negligible. FP exceptions from the upper 3 elements aren't a risk, because all elements hold valid FP data. However, clang/LLVM actually "understands" vector shuffles, and emits better code if it knows that only the low element matters.

Like the SSE1 version, adding the odd elements to themselves may cause FP exceptions (like overflow) that wouldn't happen otherwise, but this shouldn't be a problem. Denormals are slow, but IIRC producing a +Inf result isn't on most uarches.


SSE3 optimizing for code-size

If code-size is your major concern, two haddps (_mm_hadd_ps) instructions will do the trick (Paul R's answer). This is also the easiest to type and remember. It is not fast, though. Even Intel Skylake still decodes each haddps to 3 uops, with 6 cycle latency. So even though it saves machine-code bytes (L1 I-cache), it takes up more space in the more-valuable uop-cache. Real use-cases for haddps: a transpose-and-sum problem, or doing some scaling at an intermediate step in this SSE atoi() implementation.


__m256 float with AVX:

This version saves a code byte vs. Marat's answer to the AVX question.

#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}
#endif

 vmovaps xmm1,xmm0               # huh, what the heck gcc?  Just extract to xmm1
 vextractf128 xmm0,ymm0,0x1
 vaddps xmm0,xmm1,xmm0
 vmovshdup xmm1,xmm0
 vaddps xmm0,xmm1,xmm0
 vmovhlps xmm1,xmm1,xmm0
 vaddss xmm0,xmm0,xmm1
 vzeroupper 
 ret

__m128d double Double-precision:

double hsum_pd_sse2(__m128d vd) {                      // v = [ B | A ]
    __m128 undef  = _mm_undefined_ps();                       // don't worry, we only use addSD, never touching the garbage bits with an FP add
    __m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd));  // there is no movhlpd
    __m128d shuf  = _mm_castps_pd(shuftmp);
    return  _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}

# gcc 5.3.0 -O3
    pxor    xmm1, xmm1          # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
    movhlps xmm1, xmm0
    addsd   xmm0, xmm1


# clang 3.7.1 -O3 again doesn't use movhlps:
    xorpd   xmm2, xmm2          # with  #define _mm_undefined_ps _mm_setzero_ps
    movapd  xmm1, xmm0
    unpckhpd        xmm1, xmm2
    addsd   xmm1, xmm0
    movapd  xmm0, xmm1    # another clang bug: wrong choice of operand order


// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
    double tmp;
    _mm_storeh_pd(&tmp, vd);       // store the high half
    double lo = _mm_cvtsd_f64(vd); // cast the low half
    return lo+tmp;
}

    # gcc 5.3 -O3
    haddpd  xmm0, xmm0   # Lower latency but less throughput than storing to memory

    # ICC13
    movhpd    QWORD PTR [-8+rsp], xmm0    # only needs the store port, not the shuffle unit
    addsd     xmm0, QWORD PTR [-8+rsp]

Storing to memory and back avoids an ALU uop. That's good if shuffle port pressure, or ALU uops in general, are a bottleneck. (Note that it doesn't need to sub rsp, 8 or anything because the x86-64 SysV ABI provides a red-zone that signal handlers won't step on.)

Some people store to an array and sum all the elements, but compilers usually don't realize that the low element of the array is still there in a register from before the store.


__m128i int32_t Integer:

pshufd is a convenient copy-and-shuffle. Bit and byte shifts are unfortunately in-place, and punpckhqdq puts the high half of the destination in the low half of the result, opposite of the way movhlps can extract the high half into a different register.

Using movhlps for the first step might be good on some CPUs, but only if we have a scratch reg. pshufd is a safe choice, and fast on everything after Merom.

int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
    __m128i hi64  = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // SSE2 movd
    //return _mm_extract_epi32(hl, 0);     // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}

    # gcc 5.3 -O3
    pshufd xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    pshuflw xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    movd   eax,xmm0

int hsum_epi32_ssse3_slow_smallcode(__m128i x){
    x = _mm_hadd_epi32(x, x);
    x = _mm_hadd_epi32(x, x);
    return _mm_cvtsi128_si32(x);
}

On some CPUs, it's safe to use FP shuffles on integer data. I didn't do this, since on modern CPUs that will at most save 1 or 2 code bytes, with no speed gains (other than code size/alignment effects).

埖埖迣鎅 2024-12-06 04:36:41

SSE2

所有四个:

const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

r1+r2+r3:

const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));

我发现它们的速度与双 HADDPS 大致相同(但我没有太仔细地测量)。

SSE2

All four:

const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

r1+r2+r3:

const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));

I've found these to be about same speed as double HADDPS (but I haven't measured too closely).

莫言歌 2024-12-06 04:36:41

您可以在 SSE3 中使用两条 HADDPS 指令来完成此操作:

v = _mm_hadd_ps(v, v);
v = _mm_hadd_ps(v, v);

这会将总和放入所有元素中。

You can do it in two HADDPS instructions in SSE3:

v = _mm_hadd_ps(v, v);
v = _mm_hadd_ps(v, v);

This puts the sum in all elements.

沙沙粒小 2024-12-06 04:36:41

我肯定会尝试 SSE 4.2。如果您多次执行此操作(如果性能是一个问题,我假设您是这样做的),您可以使用 (1,1,1,1) 预加载寄存器,然后执行几次 dot4(my_vec(s), one_vec)在它上面。是的,它做了多余的乘法,但现在这些乘法相当便宜,而且这样的操作很可能由水平依赖关系主导,这可能在新的 SSE 点积函数中得到更优化。您应该测试一下它是否优于 Paul R 发布的双水平添加。

我还建议将它与直接标量(或标量 SSE)代码进行比较 - 奇怪的是,它通常更快(通常是因为在内部它是序列化的,但使用寄存器旁路紧密流水线化,其中特殊的水平指令可能无法快速路径(尚未)),除非您正在运行类似 SIMT 的代码,听起来你不是这样的(否则你会做四点积)。

I would definitely give SSE 4.2 a try. If you are doing this multiple times (I assume you are if performance is an issue), you can pre-load a register with (1,1,1,1), and then do several dot4(my_vec(s), one_vec) on it. Yes, it does a superfluous multiply, but those are fairly cheap these days and such an op is likely to be dominated by the horizontal dependencies, which may be more optimized in the new SSE dot product function. You should test to see if it outperforms the double horizontal add Paul R posted.

I also suggest comparing it to straight scalar (or scalar SSE) code - strangely enough it is often faster (usually because internally it is serialized but tightly pipelined using register bypass, where special horizontal instructions may not be fast pathed (yet)) unless you are running SIMT-like code, which it sounds like you are not (otherwise you would do four dot products).

趁年轻赶紧闹 2024-12-06 04:36:41

通常,最快可能的方式的问题预先假设一项任务需要在时间关键的循环中多次完成。

那么最快的方法可能是成对工作的迭代方法,它分摊了迭代之间的一些工作。

将向量拆分为低/高部分的总缩减成本为 O(log2(N)),而将向量拆分为偶数/奇数序列的摊余成本为 O(1)。

inline vec update(vec context, vec data) {
    vec even = get_evens(context, data);
    vec odd = get_odds(context, data);
    return vertical_operation(even, odd);
}

void my_algo(vec *data, int N, vec_element_type *out) {

   vec4 context{0,0,0,0};
   context = update(context, data[0]);
   int i;
   for (int i = 0; i < N-1; i++) {
       context = update(context, data[i+1]);
       output[i] = extract_lane(context, 1);
   }
   context = update(context, anything);
   output[N-1] = extract_lane(context, 1);
}

总和将从累加器的第二个元素(索引 1)(1 次迭代后)中找到,而第一个元素将包含迄今为止所有元素的总减少。

Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]

evens = [ -- ][ -- ][ i0 ][ i2 ]
odds  = [ -- ][ -- ][ i1 ][ i3 ]
-------   vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]


input = [ 4 ][ 5 ][ 6 ][ 7 ]

evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds  = [ -- ][ 23 ][ 5 ][ 7 ]

Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]

New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds =  [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
        

我怀疑,对于 3 或 4 的向量长度,这是否会比 Cordes 先生提出的更快,但是对于 16 或 8 位数据,这种方法应该被证明是值得的。那么当然需要分别进行3轮或4轮才能得到结果。

如果水平运算恰好是求和——那么每次迭代实际上可以只使用一个hadd

Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.

Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.

The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).

inline vec update(vec context, vec data) {
    vec even = get_evens(context, data);
    vec odd = get_odds(context, data);
    return vertical_operation(even, odd);
}

void my_algo(vec *data, int N, vec_element_type *out) {

   vec4 context{0,0,0,0};
   context = update(context, data[0]);
   int i;
   for (int i = 0; i < N-1; i++) {
       context = update(context, data[i+1]);
       output[i] = extract_lane(context, 1);
   }
   context = update(context, anything);
   output[N-1] = extract_lane(context, 1);
}

The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.

Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]

evens = [ -- ][ -- ][ i0 ][ i2 ]
odds  = [ -- ][ -- ][ i1 ][ i3 ]
-------   vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]


input = [ 4 ][ 5 ][ 6 ][ 7 ]

evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds  = [ -- ][ 23 ][ 5 ][ 7 ]

Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]

New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds =  [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
        

I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per iteration.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文