进行水平SSE矢量总和(或其他还原)的最快方法

发布于 2025-01-26 14:24:07 字数 249 浏览 1 评论 0 原文

给定三个(或四个)浮子的向量。总结最快的方法是什么?

SSE(移动,洗牌,添加,movd)总是比x87快吗? SSE3中的水平添加说明值得吗?

搬到FPU,然后FADDP,FADDP的成本是多少?什么是最快的特定指令序列?

“尝试安排事情,以便您可以一次总结四个向量”将不会被视为答案。 :-)例如,为了求和一个数组,您可以使用多个向量蓄能器进行垂直总和(隐藏addps延迟),并在循环后降低到一个,但是您需要水平总和最后一个向量。

Given a vector of three (or four) floats. What is the fastest way to sum them?

Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?

What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?

"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-) e.g. for summing an array, you can use multiple vector accumulators for vertical sums (to hide addps latency), and reduce down to one after the loop, but then you need to horizontally sum that last vector.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

紅太極 2025-02-02 14:24:08

我肯定会尝试一下SSE 4.2。如果您要多次执行此操作(我认为如果您是性能是问题),则可以预先加载(1,1,1,1),然后执行几个dot4(my_vec(s),one_vec)在上面。是的,它会产生多余的乘积,但是如今这些倍数相当便宜,这种OP可能由水平依赖项主导,这在新的SSE DOT产品功能中可能更优化。您应该测试以查看它是否胜过双水平添加Paul R发布。

我还建议将其与直型标量(或标量SSE)代码进行比较 - 奇怪的是,它通常更快(通常是因为它内部是序列化的,但使用寄存器旁路进行了紧密的管道,在此,在特殊的水平说明中可能不会快速(尚未途径)(尚))正在运行类似Simt的代码,听起来您不是(否则您将制作四个点产品)。

I would definitely give SSE 4.2 a try. If you are doing this multiple times (I assume you are if performance is an issue), you can pre-load a register with (1,1,1,1), and then do several dot4(my_vec(s), one_vec) on it. Yes, it does a superfluous multiply, but those are fairly cheap these days and such an op is likely to be dominated by the horizontal dependencies, which may be more optimized in the new SSE dot product function. You should test to see if it outperforms the double horizontal add Paul R posted.

I also suggest comparing it to straight scalar (or scalar SSE) code - strangely enough it is often faster (usually because internally it is serialized but tightly pipelined using register bypass, where special horizontal instructions may not be fast pathed (yet)) unless you are running SIMT-like code, which it sounds like you are not (otherwise you would do four dot products).

网白 2025-02-02 14:24:08

通常,最快的方法的问题 会以一项需要多次完成的任务,即临时循环。

然后,最快的方法可能是迭代方法成对的迭代方法,它摊销了迭代之间的某些工作。

通过将矢量拆分为低/高零件来减少的总成本为O(log2(n)),而通过将矢量拆分为偶数/奇数序列的摊销成本为o(1)。

inline vec update(vec context, vec data) {
    vec even = get_evens(context, data);
    vec odd = get_odds(context, data);
    return vertical_operation(even, odd);
}

void my_algo(vec *data, int N, vec_element_type *out) {

   vec4 context{0,0,0,0};
   context = update(context, data[0]);
   int i;
   for (int i = 0; i < N-1; i++) {
       context = update(context, data[i+1]);
       output[i] = extract_lane(context, 1);
   }
   context = update(context, anything);
   output[N-1] = extract_lane(context, 1);
}

所需的总和将从累加器的第二个元素(索引1)(1迭代后)找到,而第一个元素将包含到目前为止所有元素的总降低。

Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]

evens = [ -- ][ -- ][ i0 ][ i2 ]
odds  = [ -- ][ -- ][ i1 ][ i3 ]
-------   vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]


input = [ 4 ][ 5 ][ 6 ][ 7 ]

evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds  = [ -- ][ 23 ][ 5 ][ 7 ]

Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]

New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds =  [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
        

我有疑问,是否证明这对矢量长度的3或4比MR提出的速度要快,但是对于16或8位数据,该方法应该被证明是值得的。当然,在获得结果之前,必须分别进行3或4轮。

如果水平操作恰好是总和 - 那么实际上只能使用一个 hadd 每次迭代。

Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.

Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.

The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).

inline vec update(vec context, vec data) {
    vec even = get_evens(context, data);
    vec odd = get_odds(context, data);
    return vertical_operation(even, odd);
}

void my_algo(vec *data, int N, vec_element_type *out) {

   vec4 context{0,0,0,0};
   context = update(context, data[0]);
   int i;
   for (int i = 0; i < N-1; i++) {
       context = update(context, data[i+1]);
       output[i] = extract_lane(context, 1);
   }
   context = update(context, anything);
   output[N-1] = extract_lane(context, 1);
}

The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.

Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]

evens = [ -- ][ -- ][ i0 ][ i2 ]
odds  = [ -- ][ -- ][ i1 ][ i3 ]
-------   vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]


input = [ 4 ][ 5 ][ 6 ][ 7 ]

evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds  = [ -- ][ 23 ][ 5 ][ 7 ]

Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]

New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds =  [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
        

I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per iteration.

伏妖词 2025-02-02 14:24:07

通常,对于任何形式的矢量水平还原,提取/洗牌高的一半以与低点对齐,然后垂直添加(或min/max/of/or/and/and/xor/multiply/whyther);重复直到只有一个元素(在其余的矢量中有高垃圾)。

如果您从矢量宽于128位开始,则缩小一半,直到到达128(那么您可以在该矢量上使用此答案中的一个功能)。但是,如果您需要在末尾广播到所有元素的结果,那么您可以考虑一路进行全宽宽。

相关的Q&amp;与更宽的向量和整数,以及 fp

  • __ M128 __ M128D 此答案(请参阅下面)

  • __ m256d perf for Ryzen 1与Intel(显示为什么 vextractf128 vperm2f128 获得__m256d中存储的值的总和,

  • 的值__m256 如何水平总和__m256?

  • a href =“ https://stackoverflow.com/questions/10454150/10454150/intel-avx-256-bits-version-of-dot-product-for-double-double-double-floation-floation-floation-point-v/474444445367#47445367”> intel AVX:单个向量的双精度浮点变量的256位版本的DOT产品。

  • 数组的点乘积(不仅是3或4个元素的一个向量):将垂直mul/add或fma或fma 最终。 完整的AVX+FMA阵列点示例示例循环之后有效的Hsum 。 (对于简单的总和或其他降低数组,请使用该模式,但没有多个部分,例如添加而不是FMA)。 do 不是为每个SIMD向量分别进行水平工作;最后一次。

    如何使用Simd 计数字符出现为整数示例计数 _MM256_CMPEQ_EPI8 匹配,在整个数组中再次匹配,仅在末尾进行hsumm。 (值得一提的是进行一些8位积累,然后扩大8 - &gt; 64位,以避免溢出而无需完成全部HSUM。)

Integer

  • __ m128i 32位元素:此答案(见下文)。 64位元素应该很明显:只有一个PSHUFD/PADDQ步骤。


  • __ m128i 8位unsigned uint8_t 元素不包装/溢出: psadbw 反对 _mm_setzero_si128(),然后hsum两个QWord Harves(或4或8 。 最快的方式到达水平sme sse sse sse sse nosigned byte vector 显示SSE2 128位。
    在__m512i中求和8位整数,具有AVX内在 有一个AVX512示例。 如何使用simd 计数字符出现有一个avx2 __ M256I

    (对于 int8_t 签名字节您可以XOR SET1_EPI8(0x80)在悲伤之前翻转至未签名,然后从最终的HSUM中减去偏差;请参阅 details 也显示了仅在内存中进行9个字节的优化16)。

  • 16位未签名: _mm_madd_epi16 带有set1_epi16(1)是单一uop扩大的水平添加: simd:累积相邻对。然后继续使用32位的Hsum。

  • __ M256i __ M512i 带有32位元素。
    最快的方法使用AVX512或AVX2 计算所有包装32位整数的总和。对于AVX512,英特尔添加了为您执行此操作的一堆“减少”内联函数(不是硬件说明),例如 _MM512_REDUCE_ADD_PS (以及PD,EPI32和EPI64)。还要降低_min/max/mul/和/或。手动导致基本相同的ASM。

  • 水平最大值(而不是添加):使用SSE中的__M128i向量获取最大值?


this 问题的主要答案:主要是float和 __ M128

以下是基于 Agner Fog的Microarch指南的Microarch指南和说明表。另请参见 x86 tag wiki。它们在任何CPU上都应有效,没有主要瓶颈。 (例如,我避免有助于一个Uarch但在另一个UARCH上慢一点的事情)。代码尺寸也被最小化。

常见的SSE3 / SSSE3 2X HADD < / code>习惯仅适用于代码大小,而不是任何现有CPU的速度。有一些用例(例如转台和添加,请参见下文),但是单个向量不是其中之一。

我还提供了一个AVX版本。使用AVX/AVX2的任何水平减少都应以 vextractf128 和“垂直”操作开始,以减少一个XMM( __ M128 )向量。通常,对于宽矢量,您最好的选择是重复缩小一半,直到到达128位向量,而不论元素类型如何。 (除了8位整数外,如果您想在没有溢出到更宽的元素的情况下进行HSUM,则 vpsadbw 是第一步。)

请参阅所有此代码的ASM输出 agner fog的C ++ vector类库 horizo​​ntal_add functions。 (消息板线程 github )。我使用CPP宏来选择用于SSE2,SSE4和AVX的代码尺寸的最佳洗牌,并避免使用 movdqa 何时不可用。


需要考虑以下方面的权衡:

  • 代码尺寸:出于L1 I-CACH的原因而较小,并且从磁盘(较小的二进制文件)中获取代码。总二进制规模主要是针对整个程序反复做出的编译器决策。如果您不愿用内在的用手编码一些内容,那么如果为整个程序提供任何加速,则值得花几个代码字节(请注意使展开看起来不错的微问题)。
  • UOP-CACHE尺寸:通常比L1 I $更宝贵。 4单一UOP指令的空间比2 HADDPS 的空间少,因此在这里非常相关。
  • 延迟:有时相关的
  • 吞吐量(后端端口):通常无关紧要,水平和不应是最内向的循环中唯一的东西。端口压力仅作为包含此过程的整个循环的一部分。
  • 吞吐量(总前端融合域UOPS):如果周围的代码不在HSUM使用的端口上瓶颈,则这是HSUM对整个物体吞吐量的影响的代理。

当水平添加不经常

CPU 没有UOP-CACHE 可能会喜欢2x HADDPS 如果很少使用:运行时它的运行速度很慢。 ,但这并不常见。只有2个说明最大程度地减少对周围代码的影响(I $大小)。

CPU 使用UOP-CACHE < / strong>可能会偏爱更少的UOPS,即使它是更多的说明 / X86代码大小。使用的总UOPS缓存线是我们想要最小化的,它不像最小化总UOPS那样简单(占分支和32B边界总是启动新的UOP缓存线)。

无论如何,话虽如此,水平总和出现了一个 lot ,所以这是我仔细制作一些编译的版本的尝试。在任何真实硬件上都没有基准测试,甚至仔细测试。洗牌常数或其他东西中可能有错误。


如果您正在制作代码的后备 /基线版本,请记住,只有旧的CPU才能运行< / strong>;较新的CPU将运行您的AVX版本,SSE4.1或其他。

旧的CPU,例如K8和Core2(Merom),而更早的Core2(Merom)只有64位随机装置。 Core2在大多数说明中具有128位执行单元,但没有用于混音。 (Pentium M和K8将所有128B矢量说明作为两个64位半半处理)。

movhlps 之类的示意者在64位块中移动数据(在64位内部没有改组)也很快。

相关:在新CPU上进行散装,以及避免使用1/时钟散装吞吐量瓶颈的技巧:在AVX512中的128位交叉车道操作提供更好的性能吗?

slow Shuffles

  • movhlps (merom>(merom>) 1UOP)比 shufps (merom:3UOPS)快得多。在五角星上,比 Movaps 便宜。此外,它在Core2上的FP域中运行,避免了其他混乱的旁路延迟。
  • umpcklpd unpcklps 快。
  • pshufd 很慢, pshuflw / pshufhw 很快(因为它们只会随机降低64位半场)
  • pshufb mm0 (mmx )快速, pshufb xmm0 很慢。
  • HADDPS 非常慢(Merom和Pentium M上的6UOPS)
  • movshdup (merom:1UOP)很有趣:这是唯一的1uop insn insn insn in 64B元素。

shufps core2(包括Penryn)将数据带入整数域,导致旁路延迟使其返回 addps 的FP执行单元,但是 movhlps < /代码>完全在FP域中。 SHUFPD 也在浮点域中运行。

movshdup 在整数域运行,但仅是一个UOP。

AMD K10,Intel Core2(Penryn/Wolfdale)和后来的所有CPU,将所有XMM随机运行为单个UOP。 (但请在Penryn上注意带有 shufps 的旁路延迟,避免使用 movhlps


没有AVX,避免浪费 Move> Moveaps / movdqa 指令需要仔细选择洗牌。只有少数几个散装可以用作复制和剃须,而不是修改目的地。可以将来自两个输入的数据组合的混合(例如 umpck* movhlps )与不再需要的TMP变量一起使用,而不是 _MM_MOVEHL_PS(相同,相同)

其中一些可以更快地做(节省动作),但通过将假人arg用作初始混音的目的地来丑陋 /更少“清洁”。< / strong>:

// Use dummy = a recently-dead variable that vec depends on,
//  so it doesn't introduce a false dependency,
//  and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
    // With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
    (void)dummy;
    return _mm_unpackhi_pd(vec, vec);
#else
    // Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
    __m128 tmp = _mm_castpd_ps(dummy);
    __m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
    return high;
#endif
}

___m128 float with sse1(aka) SSE):

float hsum_ps_sse1(__m128 v) {                                  // v = [ D C | B A ]
    __m128 shuf   = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));  // [ C D | A B ]
    __m128 sums   = _mm_add_ps(v, shuf);      // sums = [ D+C C+D | B+A A+B ]
    shuf          = _mm_movehl_ps(shuf, sums);      //  [   C   D | D+C C+D ]  // let the compiler avoid a mov by reusing shuf
    sums          = _mm_add_ss(sums, shuf);
    return    _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

    # clang 3.7.1 -O3:  
    movaps  xmm1, xmm0
    shufps  xmm1, xmm1, 177
    addps   xmm1, xmm0
    movaps  xmm0, xmm1
    shufpd  xmm0, xmm0, 1
    addss   xmm0, xmm1

我报告了a clang bug在悲观的造成混乱。它有自己的内部代表,用于改组,并将其恢复为洗牌。 GCC经常使用直接匹配您使用的固有的说明。

通常,在指令选择未手动调整的代码中,Clang的表现要好于GCC,或者即使在非构成案例最佳的情况下,恒定的传播也可以简化事物。总体而言,编译器像适当的编译器有关内在的编译器,而不仅仅是汇编器,这是一件好事。编译器通常可以从标量C中产生良好的ASM,甚至不会尝试按照ASM的方式工作。最终,编译器将将内在物作为另一个C运算符作为优化器的输入。


__m128带有SSE3的浮动,

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3: perfectly optimal code
    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

这具有多个优点:

  • 不需要任何 Movapaps 副本可以围绕破坏性的洗牌工作(无AVX): movshdup xmm1,xmm2 的目的地仅写作,因此它为我们创建 tmp 。这也是为什么我使用 movehl_ps(tmp,sums)而不是 movehl_ps(sums,sums)

  • 小型代码大小。改组指令很小: movhlps 是3个字节, movshdup 为4个字节(与 shufps 相同)。不需要立即字节,因此使用AVX, vshufps 是5个字节,但是 vmovhlps and vmovshdup 均为4。

。代码> addps 而不是 addss 。由于它不会在内部环内使用,因此切换额外晶体管的额外能量可能可以忽略不计。上层3个元素的FP异常不是风险,因为所有元素都有有效的FP数据。但是,clang/llvm实际上“理解”向量的示意图,如果知道只有低元素很重要,则会发出更好的代码。

像SSE1版本一样,添加奇数元素可能会导致fp异常(例如溢出),否则不会发生,但这不是问题。 Denmals很慢,但是产生A +INF结果的IIRC并不是大多数Uarches。


SSE3优化用于代码尺寸的SSE Size

如果代码大小是您的主要问题,则两个 HADDPS _MM_HADD_PS )指令将执行此技巧(Paul R的答案)。这也是最容易输入和记住的。但是,它不是快速。甚至英特尔Skylake仍然将每个 HADDPS 分解为3个UOP,并具有6个周期延迟。因此,即使它保存了机器代码字节(L1 I-CACHE),它仍占用了可估计的UOP-CACHE中的更多空间。 haddps 的真实用例: transpose-and-sum问题,或在中间步骤进行缩放在此sse <代码> atoi()实现


__m256带有AVX的float:

此版本保存代码字节vs.

#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}
#endif

 vmovaps xmm1,xmm0               # huh, what the heck gcc?  Just extract to xmm1
 vextractf128 xmm0,ymm0,0x1
 vaddps xmm0,xmm1,xmm0
 vmovshdup xmm1,xmm0
 vaddps xmm0,xmm1,xmm0
 vmovhlps xmm1,xmm1,xmm0
 vaddss xmm0,xmm0,xmm1
 vzeroupper 
 ret

__M128D双重精确:

double hsum_pd_sse2(__m128d vd) {                      // v = [ B | A ]
    __m128 undef  = _mm_undefined_ps();                       // don't worry, we only use addSD, never touching the garbage bits with an FP add
    __m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd));  // there is no movhlpd
    __m128d shuf  = _mm_castps_pd(shuftmp);
    return  _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}

# gcc 5.3.0 -O3
    pxor    xmm1, xmm1          # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
    movhlps xmm1, xmm0
    addsd   xmm0, xmm1


# clang 3.7.1 -O3 again doesn't use movhlps:
    xorpd   xmm2, xmm2          # with  #define _mm_undefined_ps _mm_setzero_ps
    movapd  xmm1, xmm0
    unpckhpd        xmm1, xmm2
    addsd   xmm1, xmm0
    movapd  xmm0, xmm1    # another clang bug: wrong choice of operand order


// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
    double tmp;
    _mm_storeh_pd(&tmp, vd);       // store the high half
    double lo = _mm_cvtsd_f64(vd); // cast the low half
    return lo+tmp;
}

    # gcc 5.3 -O3
    haddpd  xmm0, xmm0   # Lower latency but less throughput than storing to memory

    # ICC13
    movhpd    QWORD PTR [-8+rsp], xmm0    # only needs the store port, not the shuffle unit
    addsd     xmm0, QWORD PTR [-8+rsp]

存储到内存并避免使用Alu UOP。如果洗牌端口压力(通常是Alu Uops)是瓶颈,那很好。 (请注意,它不需要 sub rsp,8 或任何

东西并总结所有元素,但是编译器通常没有意识到该数组的低元素仍在商店前的寄存器中。


__m128i int32_t Integer:

pshufd 是一个方便的复制和调整。不幸的是,位和字节偏移是就地的, punpckhqdq 将目的地的最高一半放在结果的低一半中,与 movhlps 相反一半陷入另一个寄存器。

在某些CPU上使用 Movhlps 可能是好的,但前提是我们有一个刮擦式。 PSHUFD 是一个安全的选择,并且在Merom之后的所有内容上都会快速选择。

int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
    __m128i hi64  = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // SSE2 movd
    //return _mm_extract_epi32(hl, 0);     // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}

    # gcc 5.3 -O3
    pshufd xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    pshuflw xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    movd   eax,xmm0

int hsum_epi32_ssse3_slow_smallcode(__m128i x){
    x = _mm_hadd_epi32(x, x);
    x = _mm_hadd_epi32(x, x);
    return _mm_cvtsi128_si32(x);
}

在某些CPU上,可以安全地在整数数据上使用fp Shuffles。我没有这样做,因为在现代CPU上最多可以节省1或2个代码字节,而没有速度增长(除了代码大小/对齐效果)。

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).

If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.

Related Q&As for wider vectors, and integers, and FP

Integer


Main answer to this question: mostly float and __m128

Here are some versions tuned based on Agner Fog's microarch guide's microarch guide and instruction tables. See also the tag wiki. They should be efficient on any CPU, with no major bottlenecks. (e.g. I avoided things that would help one uarch a bit but be slow on another uarch). Code-size is also minimized.

The common SSE3 / SSSE3 2x hadd idiom is only good for code-size, not speed on any existing CPUs. There are use-cases for it (like transpose and add, see below), but a single vector isn't one of them.

I've also included an AVX version. Any kind of horizontal reduction with AVX / AVX2 should start with a vextractf128 and a "vertical" operation to reduce down to one XMM (__m128) vector. In general for wide vectors, your best bet is to narrow in half repeatedly until you're down to a 128-bit vector, regardless of element type. (Except for 8-bit integer, then vpsadbw as a first step if you want to hsum without overflow to wider elements.)

See the asm output from all this code on the Godbolt Compiler Explorer. See also my improvements to Agner Fog's C++ Vector Class Library horizontal_add functions. (message board thread, and code on github). I used CPP macros to select optimal shuffles for code-size for SSE2, SSE4, and AVX, and for avoiding movdqa when AVX isn't available.


There are tradeoffs to consider:

  • code size: smaller is better for L1 I-cache reasons, and for code fetch from disk (smaller binaries). Total binary size mostly matters for compiler decisions made repeatedly all over a program. If you're bothering to hand-code something with intrinsics, it's worth spending a few code bytes if it gives any speedup for the whole program (be careful of microbenchmarks that make unrolling look good).
  • uop-cache size: Often more precious than L1 I$. 4 single-uop instructions can take less space than 2 haddps, so this is highly relevant here.
  • latency: Sometimes relevant
  • throughput (back-end ports): usually irrelevant, horizontal sums shouldn't be the only thing in an innermost loop. Port pressure matters only as part of the whole loop that contains this.
  • throughput (total front-end fused-domain uops): If surrounding code doesn't bottleneck on the same port that the hsum uses, this is a proxy for the impact of the hsum on the throughput of the whole thing.

When a horizontal add is infrequent:

CPUs with no uop-cache might favour 2x haddps if it's very rarely used: It's slowish when it does run, but that's not often. Being only 2 instructions minimizes the impact on the surrounding code (I$ size).

CPUs with a uop-cache will probably favour something that takes fewer uops, even if it's more instructions / more x86 code-size. Total uops cache-lines used is what we want to minimize, which isn't as simple as minimizing total uops (taken branches and 32B boundaries always start a new uop cache line).

Anyway, with that said, horizontal sums come up a lot, so here's my attempt at carefully crafting some versions that compile nicely. Not benchmarked on any real hardware, or even carefully tested. There might be bugs in the shuffle constants or something.


If you're making a fallback / baseline version of your code, remember that only old CPUs will run it; newer CPUs will run your AVX version, or SSE4.1 or whatever.

Old CPUs like K8, and Core2(merom) and earlier only have 64bit shuffle units. Core2 has 128bit execution units for most instructions, but not for shuffles. (Pentium M and K8 handle all 128b vector instructions as two 64bit halves).

Shuffles like movhlps that move data in 64-bit chunks (no shuffling within 64-bit halves) are fast, too.

Related: shuffles on new CPUs, and tricks for avoiding 1/clock shuffle throughput bottleneck on Haswell and later: Do 128bit cross lane operations in AVX512 give better performance?

On old CPUs with slow shuffles:

  • movhlps (Merom: 1uop) is significantly faster than shufps (Merom: 3uops). On Pentium-M, cheaper than movaps. Also, it runs in the FP domain on Core2, avoiding the bypass delays from other shuffles.
  • unpcklpd is faster than unpcklps.
  • pshufd is slow, pshuflw/pshufhw are fast (because they only shuffle a 64bit half)
  • pshufb mm0 (MMX) is fast, pshufb xmm0 is slow.
  • haddps is very slow (6uops on Merom and Pentium M)
  • movshdup (Merom: 1uop) is interesting: It's the only 1uop insn that shuffles within 64b elements.

shufps on Core2(including Penryn) brings data into the integer domain, causing a bypass delay to get it back to the FP execution units for addps, but movhlps is entirely in the FP domain. shufpd also runs in the float domain.

movshdup runs in the integer domain, but is only one uop.

AMD K10, Intel Core2(Penryn/Wolfdale), and all later CPUs, run all xmm shuffles as a single uop. (But note the bypass delay with shufps on Penryn, avoided with movhlps)


Without AVX, avoiding wasted movaps/movdqa instructions requires careful choice of shuffles. Only a few shuffles work as a copy-and-shuffle, rather than modifying the destination. Shuffles that combine data from two inputs (like unpck* or movhlps) can be used with a tmp variable that's no longer needed instead of _mm_movehl_ps(same,same).

Some of these can be made faster (save a MOVAPS) but uglier / less "clean" by taking a dummy arg for use as a destination for an initial shuffle. For example:

// Use dummy = a recently-dead variable that vec depends on,
//  so it doesn't introduce a false dependency,
//  and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
    // With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
    (void)dummy;
    return _mm_unpackhi_pd(vec, vec);
#else
    // Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
    __m128 tmp = _mm_castpd_ps(dummy);
    __m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
    return high;
#endif
}

__m128 float with SSE1 (aka SSE):

float hsum_ps_sse1(__m128 v) {                                  // v = [ D C | B A ]
    __m128 shuf   = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));  // [ C D | A B ]
    __m128 sums   = _mm_add_ps(v, shuf);      // sums = [ D+C C+D | B+A A+B ]
    shuf          = _mm_movehl_ps(shuf, sums);      //  [   C   D | D+C C+D ]  // let the compiler avoid a mov by reusing shuf
    sums          = _mm_add_ss(sums, shuf);
    return    _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

    # clang 3.7.1 -O3:  
    movaps  xmm1, xmm0
    shufps  xmm1, xmm1, 177
    addps   xmm1, xmm0
    movaps  xmm0, xmm1
    shufpd  xmm0, xmm0, 1
    addss   xmm0, xmm1

I reported a clang bug about pessimizing the shuffles. It has its own internal representation for shuffling, and turns that back into shuffles. gcc more often uses the instructions that directly match the intrinsic you used.

Often clang does better than gcc, in code where the instruction choice isn't hand-tuned, or constant-propagation can simplify things even when the intrinsics are optimal for the non-constant case. Overall it's a good thing that compilers work like a proper compiler for intrinsics, not just an assembler. Compilers can often generate good asm from scalar C that doesn't even try to work the way good asm would. Eventually compilers will treat intrinsics as just another C operator as input for the optimizer.


__m128 float with SSE3

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3: perfectly optimal code
    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

This has several advantages:

  • doesn't require any movaps copies to work around destructive shuffles (without AVX): movshdup xmm1, xmm2's destination is write-only, so it creates tmp out of a dead register for us. This is also why I used movehl_ps(tmp, sums) instead of movehl_ps(sums, sums).

  • small code-size. The shuffling instructions are small: movhlps is 3 bytes, movshdup is 4 bytes (same as shufps). No immediate byte is required, so with AVX, vshufps is 5 bytes but vmovhlps and vmovshdup are both 4.

I could save another byte with addps instead of addss. Since this won't be used inside inner loops, the extra energy to switch the extra transistors is probably negligible. FP exceptions from the upper 3 elements aren't a risk, because all elements hold valid FP data. However, clang/LLVM actually "understands" vector shuffles, and emits better code if it knows that only the low element matters.

Like the SSE1 version, adding the odd elements to themselves may cause FP exceptions (like overflow) that wouldn't happen otherwise, but this shouldn't be a problem. Denormals are slow, but IIRC producing a +Inf result isn't on most uarches.


SSE3 optimizing for code-size

If code-size is your major concern, two haddps (_mm_hadd_ps) instructions will do the trick (Paul R's answer). This is also the easiest to type and remember. It is not fast, though. Even Intel Skylake still decodes each haddps to 3 uops, with 6 cycle latency. So even though it saves machine-code bytes (L1 I-cache), it takes up more space in the more-valuable uop-cache. Real use-cases for haddps: a transpose-and-sum problem, or doing some scaling at an intermediate step in this SSE atoi() implementation.


__m256 float with AVX:

This version saves a code byte vs. Marat's answer to the AVX question.

#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}
#endif

 vmovaps xmm1,xmm0               # huh, what the heck gcc?  Just extract to xmm1
 vextractf128 xmm0,ymm0,0x1
 vaddps xmm0,xmm1,xmm0
 vmovshdup xmm1,xmm0
 vaddps xmm0,xmm1,xmm0
 vmovhlps xmm1,xmm1,xmm0
 vaddss xmm0,xmm0,xmm1
 vzeroupper 
 ret

__m128d double Double-precision:

double hsum_pd_sse2(__m128d vd) {                      // v = [ B | A ]
    __m128 undef  = _mm_undefined_ps();                       // don't worry, we only use addSD, never touching the garbage bits with an FP add
    __m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd));  // there is no movhlpd
    __m128d shuf  = _mm_castps_pd(shuftmp);
    return  _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}

# gcc 5.3.0 -O3
    pxor    xmm1, xmm1          # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
    movhlps xmm1, xmm0
    addsd   xmm0, xmm1


# clang 3.7.1 -O3 again doesn't use movhlps:
    xorpd   xmm2, xmm2          # with  #define _mm_undefined_ps _mm_setzero_ps
    movapd  xmm1, xmm0
    unpckhpd        xmm1, xmm2
    addsd   xmm1, xmm0
    movapd  xmm0, xmm1    # another clang bug: wrong choice of operand order


// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
    double tmp;
    _mm_storeh_pd(&tmp, vd);       // store the high half
    double lo = _mm_cvtsd_f64(vd); // cast the low half
    return lo+tmp;
}

    # gcc 5.3 -O3
    haddpd  xmm0, xmm0   # Lower latency but less throughput than storing to memory

    # ICC13
    movhpd    QWORD PTR [-8+rsp], xmm0    # only needs the store port, not the shuffle unit
    addsd     xmm0, QWORD PTR [-8+rsp]

Storing to memory and back avoids an ALU uop. That's good if shuffle port pressure, or ALU uops in general, are a bottleneck. (Note that it doesn't need to sub rsp, 8 or anything because the x86-64 SysV ABI provides a red-zone that signal handlers won't step on.)

Some people store to an array and sum all the elements, but compilers usually don't realize that the low element of the array is still there in a register from before the store.


__m128i int32_t Integer:

pshufd is a convenient copy-and-shuffle. Bit and byte shifts are unfortunately in-place, and punpckhqdq puts the high half of the destination in the low half of the result, opposite of the way movhlps can extract the high half into a different register.

Using movhlps for the first step might be good on some CPUs, but only if we have a scratch reg. pshufd is a safe choice, and fast on everything after Merom.

int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
    __m128i hi64  = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // SSE2 movd
    //return _mm_extract_epi32(hl, 0);     // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}

    # gcc 5.3 -O3
    pshufd xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    pshuflw xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    movd   eax,xmm0

int hsum_epi32_ssse3_slow_smallcode(__m128i x){
    x = _mm_hadd_epi32(x, x);
    x = _mm_hadd_epi32(x, x);
    return _mm_cvtsi128_si32(x);
}

On some CPUs, it's safe to use FP shuffles on integer data. I didn't do this, since on modern CPUs that will at most save 1 or 2 code bytes, with no speed gains (other than code size/alignment effects).

葬心 2025-02-02 14:24:07

SSE2

全部四个:

const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

R1+R2+R3:

const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));

我发现它们的速度与double haddps (但我没有太仔细地测量)。

SSE2

All four:

const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

r1+r2+r3:

const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));

I've found these to be about same speed as double HADDPS (but I haven't measured too closely).

时间海 2025-02-02 14:24:07

您可以在SSE3中的两个 HADDPS 指令中进行操作:

v = _mm_hadd_ps(v, v);
v = _mm_hadd_ps(v, v);

这将总和在所有元素中。

You can do it in two HADDPS instructions in SSE3:

v = _mm_hadd_ps(v, v);
v = _mm_hadd_ps(v, v);

This puts the sum in all elements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文