进行水平 SSE 向量和(或其他简化)的最快方法
给定一个由三个(或四个)浮点数组成的向量。对它们求和的最快方法是什么?
SSE(movaps、shuffle、add、movd)总是比 x87 快吗? SSE3 中的水平相加指令值得吗?
转移到 FPU,然后是 faddp、faddp 的成本是多少?最快的具体指令序列是什么?
“尝试安排一些事情,以便一次可以对四个向量求和”将不会被接受作为答案。 :-) 例如,为了对数组求和,您可以使用多个向量累加器进行垂直求和(以隐藏 addps 延迟),并在循环后减少到 1,但随后您需要对最后一个向量进行水平求和。
Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?
What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?
"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-) e.g. for summing an array, you can use multiple vector accumulators for vertical sums (to hide addps latency), and reduce down to one after the loop, but then you need to horizontally sum that last vector.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一般来说,对于任何类型的向量水平缩减,提取/洗牌高半部分以与低部分对齐,然后垂直相加(或最小/最大/或/和/异或/乘/其他);重复直到只有一个元素(向量的其余部分中有大量垃圾)。
如果您从宽度超过 128 位的向量开始,将其缩小一半,直到达到 128 位(然后您可以在该向量上使用此答案中的函数之一)。但如果您需要将结果广播到最后的所有元素,那么您可以考虑一路进行全角洗牌。
更广泛的向量和整数以及 FP
__m128
和__m128d
的相关问答__m256d
对 Ryzen 1 与 Intel 进行性能分析(说明为什么vextractf128
比vperm2f128
好得多)使用 SSE/AVX 获取 __m256d 中存储的值的总和__m256
如何水平求和__m256?<一个href="https://stackoverflow.com/questions/10454150/intel-avx-256-bits-version-of-dot-product-for-double- precision-floating-point-v/47445367#47445367">英特尔 AVX :单向量的双精度浮点变量的点积的 256 位版本。
数组的点积(不仅仅是 3 或 4 个元素的单个向量):对 多个累加器,最后是hsum。 完整的 AVX+FMA 数组点积示例,包括循环后的高效 hsum。 (对于数组的简单求和或其他缩减,请使用该模式但不使用乘法部分,例如 add 而不是 fma)。不要为每个 SIMD 向量单独进行水平工作;最后执行一次。
如何使用 SIMD 计算字符出现次数作为整数示例计数
_mm256_cmpeq_epi8
匹配,再次在整个数组上,仅在末尾进行 hsumming。 (值得特别提及的是,先进行一些 8 位累加,然后扩大 8 -> 64 位以避免溢出,但此时无需执行完整的 hsum。)整数
__m128i 32 位元素:这个答案(见下文)。 64 位元素应该是显而易见的:只有一个 pshufd/paddq 步骤。
__m128i
8 位无符号uint8_t
元素,无换行/溢出:psadbw
与_mm_setzero_si128()
,然后对两个 qword 求和一半(或者 4 或 8 对于更宽的向量)。 水平求和 SSE 无符号字节向量的最快方法 显示128 位,带 SSE2。使用 AVX 内在函数对 __m512i 中的 8 位整数求和有一个 AVX512 示例。 如何使用 SIMD 计算字符出现次数有一个 AVX2
__m256i
示例。(对于
int8_t
有符号字节,您可以通过异或 set1_epi8(0x80) 在 SAD 之前翻转为无符号,然后从最终的 hsum 中减去偏差;请参阅 详细信息,还显示了优化仅从内存中执行 9 个字节而不是 16 个字节)。16 位无符号:
_mm_madd_epi16
与 set1_epi16(1) 是单微指令加宽水平加法:SIMD:累积相邻对。然后继续处理 32 位 hsum。__m256i
和__m512i
具有 32 位元素。最快方法使用 AVX512 或 AVX2 计算所有打包 32 位整数的总和。对于 AVX512,英特尔添加了一堆“reduce”内联函数(不是硬件指令)来为您执行此操作,例如 _mm512_reduce_add_ps(以及 pd、epi32 和 epi64)。还减少_min/max/mul/和/或。手动执行会得到基本相同的 asm。
水平最大值(而不是相加):使用 SSE 获取 __m128i 向量中的最大值?
这个问题的主要答案:主要是浮动和
__m128
以下是根据 Agner Fog 的微架构指南 的微架构指南调整的一些版本指令表。另请参阅 x86 标签维基。它们在任何 CPU 上都应该高效,没有重大瓶颈。 (例如,我避免了那些对一个 uarch 有一点帮助但对另一个 uarch 来说很慢的事情)。代码大小也被最小化。
常见的 SSE3 / SSSE3 2x
hadd
习惯用法仅适用于代码大小,而不适用于任何现有 CPU 的速度。它有一些用例(例如转置和添加,见下文),但单个向量不是其中之一。我还包含了 AVX 版本。使用 AVX / AVX2 进行的任何类型的水平缩减都应以
vextractf128
和“垂直”操作开始,以缩减为一个 XMM (__m128
) 向量。一般来说,对于宽向量,最好的选择是反复缩小一半,直到缩小到 128 位向量,无论元素类型如何。 (除了 8 位整数,如果您想对 hsum 进行求和而不溢出到更宽的元素,那么首先要使用vpsadbw
。)查看所有这些代码的 asm 输出 关于 Godbolt 编译器资源管理器。另请参阅我对 Agner Fog 的 C++ 矢量类库
horizontal_add
函数。 (留言板线程,以及 github)。我使用 CPP 宏为 SSE2、SSE4 和 AVX 的代码大小选择最佳洗牌,并在 AVX 不可用时避免movdqa
。需要考虑一些权衡:
当水平添加不频繁时:
没有 uop-cache 的 CPU 可能会喜欢 2x
haddps
(如果很少使用):运行时速度很慢,但这种情况并不常见。只有 2 条指令可以最大限度地减少对周围代码(I$ 大小)的影响。具有 uop 缓存的 CPU 可能会更喜欢需要更少 uop 的东西,即使它有更多的指令/更多的 x86 代码大小。使用的 uop 缓存行总数是我们想要最小化的,这并不像最小化 uop 总数那么简单(采用的分支和 32B 边界总是启动一个新的 uop 缓存行)。
不管怎样,话虽如此,水平总和会出现很多,所以这是我精心制作一些编译良好的版本的尝试。没有在任何真实硬件上进行基准测试,甚至没有经过仔细测试。洗牌常量或其他内容可能存在错误。
如果您正在制作代码的后备/基线版本,请记住只有旧的 CPU 才能运行它;较新的 CPU 将运行您的 AVX 版本或 SSE4.1 或其他版本。
K8、Core2(merom) 及更早版本等旧版 CPU 仅具有 64 位随机单元。 Core2 具有适用于大多数指令的 128 位执行单元,但不适用于洗牌。 (Pentium M 和 K8 将所有 128b 向量指令作为两个 64 位一半处理)。
像
movhlps
这样以 64 位块移动数据的混洗(在 64 位半部分内不进行混洗)也很快。相关:新 CPU 上的随机播放,以及避免 Haswell 及更高版本上的 1/时钟随机播放吞吐量瓶颈的技巧:AVX512 中的 128 位跨通道操作是否能提供更好的性能?
在速度较慢的旧 CPU 上shuffles:
movhlps
(Merom:1uop)明显快于shufps
(Merom:3uops)。在 Pentium-M 上,比 movaps 便宜。此外,它在 Core2 上的 FP 域中运行,避免了其他 shuffle 造成的旁路延迟。unpcklpd
比unpcklps
更快。pshufd
很慢,pshuflw
/pshufhw
很快(因为它们只随机播放 64 位的一半)pshufb mm0
(MMX ) 快,pshufb xmm0
慢。haddps
非常慢(Merom 和 Pentium M 上为 6uops)movshdup
(Merom:1uop)很有趣:它是唯一在其中进行洗牌的 1uop insn 64b 元素。Core2(包括 Penryn)上的
shufps
将数据带入整数域,导致绕过延迟将其返回到addps
的 FP 执行单元,但movhlps< /code> 完全属于 FP 域。
shufpd
也在浮点域中运行。movshdup
在整数域中运行,但只有一个微指令。AMD K10、Intel Core2(Penryn/Wolfdale) 以及所有更高版本的 CPU 将所有 xmm shuffle 作为单个 uop 运行。 (但请注意 Penryn 上使用
shufps
的旁路延迟,使用movhlps
避免)不使用 AVX,避免浪费
movaps
/movdqa
指令需要仔细选择随机播放。只有少数随机播放起到复制和随机播放的作用,而不是修改目标。组合来自两个输入的数据的随机播放(例如unpck*
或movhlps
)可以与不再需要的 tmp 变量一起使用,而不是使用_mm_movehl_ps(same,same)
。通过使用虚拟参数作为初始洗牌的目标,其中一些可以变得更快(保存 MOVAPS),但更难看/不太“干净”。 例如:
带有 SSE1 的 __m128 float(又名SSE):
我报告了一个关于悲观化的 clang bug随机播放。它有自己的洗牌内部表示,并将其变回洗牌。 gcc 更经常使用与您使用的内在函数直接匹配的指令。
通常,在指令选择未手动调整的代码中,clang 比 gcc 做得更好,或者即使内在函数对于非常量情况而言是最佳的,常量传播也可以简化事情。总的来说,编译器像一个适合内在函数的编译器一样工作,而不仅仅是一个汇编器,这是一件好事。编译器通常可以从标量 C 生成良好的 asm,但它甚至不会尝试按照良好的 asm 的方式工作。最终编译器将把内在函数视为另一个 C 运算符作为优化器的输入。
__m128 float with SSE3
这有几个优点:
不需要任何
movaps
副本来解决破坏性洗牌(无需 AVX):movshdup xmm1, xmm2
的目标是只写的,因此它为我们从死寄存器中创建tmp
。这也是我使用movehl_ps(tmp, sums)
而不是movehl_ps(sums, sums)
的原因。代码大小小。混洗指令很小:
movhlps
为 3 个字节,movshdup
为 4 个字节(与shufps
相同)。不需要立即字节,因此对于 AVX,vshufps
是 5 个字节,但vmovhlps
和vmovshdup
都是 4 个字节。我可以使用 <代码>addps而不是
addss
。由于这不会在内部循环中使用,因此切换额外晶体管的额外能量可能可以忽略不计。上面 3 个元素的 FP 异常不存在风险,因为所有元素都保存有效的 FP 数据。然而,clang/LLVM 实际上“理解”向量洗牌,并且如果它知道只有低元素重要,就会发出更好的代码。与 SSE1 版本一样,向其自身添加奇数元素可能会导致 FP 异常(如溢出),否则不会发生这种情况,但这应该不是问题。非正规化很慢,但 IIRC 产生 +Inf 结果并不在大多数 uarches 上。
SSE3 针对代码大小进行优化
如果代码大小是您主要关心的问题,则两条
haddps
(_mm_hadd_ps
) 指令即可解决问题(Paul R 的回答)。这也是最容易输入和记住的。不过,它不快。即使 Intel Skylake 仍然将每个haddps
解码为 3 uops,有 6 个周期的延迟。因此,尽管它节省了机器代码字节(L1 I-cache),但它在更有价值的 uop-cache 中占用了更多空间。haddps
的真实用例:转置求和问题,或者在中间步骤进行一些缩放在此 SSEatoi()
实现中。__m256 float with AVX:
此版本节省了一个代码字节与 Marat 对 AVX 的回答问题。
__m128d double 双精度:
存储到内存并返回可避免 ALU uop。如果洗牌端口压力或一般的 ALU uops 是瓶颈,那么这很好。 (请注意,它不需要
sub rsp, 8
或任何内容,因为 x86-64 SysV ABI 提供了信号处理程序不会踩踏的红色区域。)有些人存储到数组并对所有元素求和,但编译器通常不会意识到数组的低位元素仍然存在于存储之前的寄存器中。
__m128i int32_t 整数:
pshufd
是一种方便的复制和洗牌。不幸的是,位和字节移位是就地的,并且punpckhqdq
将目标的高半部分放在结果的低半部分中,这与movhlps
提取高部分的方式相反一半进入不同的寄存器。在某些 CPU 上,第一步使用 movhlps 可能会很好,但前提是我们有一个暂存寄存器。
pshufd
是一个安全的选择,并且在 Merom 之后的所有操作上都很快。在某些 CPU 上,对整数数据使用 FP shuffle 是安全的。我没有这样做,因为在现代 CPU 上最多可以节省 1 或 2 个代码字节,并且没有速度增益(除了代码大小/对齐效果)。
In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).
If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.
Related Q&As for wider vectors, and integers, and FP
__m128
and__m128d
This answer (see below)__m256d
with perf analysis for Ryzen 1 vs. Intel (showing whyvextractf128
is vastly better thanvperm2f128
) Get sum of values stored in __m256d with SSE/AVX__m256
How to sum __m256 horizontally?Intel AVX: 256-bits version of dot product for double precision floating point variables of single vectors.
Dot product of arrays (not just a single vector of 3 or 4 elements): do vertical mul/add or FMA into multiple accumulators, and hsum at the end. Complete AVX+FMA array dot-product example, including an efficient hsum after the loop. (For the simple sum or other reduction of an array, use that pattern but without the multiply part, e.g. add instead of fma). Do not do the horizontal work separately for each SIMD vector; do it once at the end.
How to count character occurrences using SIMD as an integer example of counting
_mm256_cmpeq_epi8
matches, again over a whole array, only hsumming at the end. (Worth special mention for doing some 8-bit accumulation then widening 8 -> 64-bit to avoid overflow without doing a full hsum at that point.)Integer
__m128i
32-bit elements: this answer (see below). 64-bit elements should be obvious: only one pshufd/paddq step.__m128i
8-bit unsigneduint8_t
elements without wrapping/overflow:psadbw
against_mm_setzero_si128()
, then hsum the two qword halves (or 4 or 8 for wider vectors). Fastest way to horizontally sum SSE unsigned byte vector shows 128-bit with SSE2.Summing 8-bit integers in __m512i with AVX intrinsics has an AVX512 example. How to count character occurrences using SIMD has an AVX2
__m256i
example.(For
int8_t
signed bytes you can XOR set1_epi8(0x80) to flip to unsigned before SAD, then subtract the bias from the final hsum; see details here, also showing an optimization for doing only 9 bytes from memory instead of 16).16-bit unsigned:
_mm_madd_epi16
with set1_epi16(1) is a single-uop widening horizontal add: SIMD: Accumulate Adjacent Pairs. Then proceed with a 32-bit hsum.__m256i
and__m512i
with 32-bit elements.Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2. For AVX512, Intel added a bunch of "reduce" inline functions (not hardware instructions) that do this for you, like
_mm512_reduce_add_ps
(and pd, epi32, and epi64). Also reduce_min/max/mul/and/or. Doing it manually leads to basically the same asm.horizontal max (instead of add): Getting max value in a __m128i vector with SSE?
Main answer to this question: mostly float and
__m128
Here are some versions tuned based on Agner Fog's microarch guide's microarch guide and instruction tables. See also the x86 tag wiki. They should be efficient on any CPU, with no major bottlenecks. (e.g. I avoided things that would help one uarch a bit but be slow on another uarch). Code-size is also minimized.
The common SSE3 / SSSE3 2x
hadd
idiom is only good for code-size, not speed on any existing CPUs. There are use-cases for it (like transpose and add, see below), but a single vector isn't one of them.I've also included an AVX version. Any kind of horizontal reduction with AVX / AVX2 should start with a
vextractf128
and a "vertical" operation to reduce down to one XMM (__m128
) vector. In general for wide vectors, your best bet is to narrow in half repeatedly until you're down to a 128-bit vector, regardless of element type. (Except for 8-bit integer, thenvpsadbw
as a first step if you want to hsum without overflow to wider elements.)See the asm output from all this code on the Godbolt Compiler Explorer. See also my improvements to Agner Fog's C++ Vector Class Library
horizontal_add
functions. (message board thread, and code on github). I used CPP macros to select optimal shuffles for code-size for SSE2, SSE4, and AVX, and for avoidingmovdqa
when AVX isn't available.There are tradeoffs to consider:
haddps
, so this is highly relevant here.When a horizontal add is infrequent:
CPUs with no uop-cache might favour 2x
haddps
if it's very rarely used: It's slowish when it does run, but that's not often. Being only 2 instructions minimizes the impact on the surrounding code (I$ size).CPUs with a uop-cache will probably favour something that takes fewer uops, even if it's more instructions / more x86 code-size. Total uops cache-lines used is what we want to minimize, which isn't as simple as minimizing total uops (taken branches and 32B boundaries always start a new uop cache line).
Anyway, with that said, horizontal sums come up a lot, so here's my attempt at carefully crafting some versions that compile nicely. Not benchmarked on any real hardware, or even carefully tested. There might be bugs in the shuffle constants or something.
If you're making a fallback / baseline version of your code, remember that only old CPUs will run it; newer CPUs will run your AVX version, or SSE4.1 or whatever.
Old CPUs like K8, and Core2(merom) and earlier only have 64bit shuffle units. Core2 has 128bit execution units for most instructions, but not for shuffles. (Pentium M and K8 handle all 128b vector instructions as two 64bit halves).
Shuffles like
movhlps
that move data in 64-bit chunks (no shuffling within 64-bit halves) are fast, too.Related: shuffles on new CPUs, and tricks for avoiding 1/clock shuffle throughput bottleneck on Haswell and later: Do 128bit cross lane operations in AVX512 give better performance?
On old CPUs with slow shuffles:
movhlps
(Merom: 1uop) is significantly faster thanshufps
(Merom: 3uops). On Pentium-M, cheaper thanmovaps
. Also, it runs in the FP domain on Core2, avoiding the bypass delays from other shuffles.unpcklpd
is faster thanunpcklps
.pshufd
is slow,pshuflw
/pshufhw
are fast (because they only shuffle a 64bit half)pshufb mm0
(MMX) is fast,pshufb xmm0
is slow.haddps
is very slow (6uops on Merom and Pentium M)movshdup
(Merom: 1uop) is interesting: It's the only 1uop insn that shuffles within 64b elements.shufps
on Core2(including Penryn) brings data into the integer domain, causing a bypass delay to get it back to the FP execution units foraddps
, butmovhlps
is entirely in the FP domain.shufpd
also runs in the float domain.movshdup
runs in the integer domain, but is only one uop.AMD K10, Intel Core2(Penryn/Wolfdale), and all later CPUs, run all xmm shuffles as a single uop. (But note the bypass delay with
shufps
on Penryn, avoided withmovhlps
)Without AVX, avoiding wasted
movaps
/movdqa
instructions requires careful choice of shuffles. Only a few shuffles work as a copy-and-shuffle, rather than modifying the destination. Shuffles that combine data from two inputs (likeunpck*
ormovhlps
) can be used with a tmp variable that's no longer needed instead of_mm_movehl_ps(same,same)
.Some of these can be made faster (save a MOVAPS) but uglier / less "clean" by taking a dummy arg for use as a destination for an initial shuffle. For example:
__m128 float with SSE1 (aka SSE):
I reported a clang bug about pessimizing the shuffles. It has its own internal representation for shuffling, and turns that back into shuffles. gcc more often uses the instructions that directly match the intrinsic you used.
Often clang does better than gcc, in code where the instruction choice isn't hand-tuned, or constant-propagation can simplify things even when the intrinsics are optimal for the non-constant case. Overall it's a good thing that compilers work like a proper compiler for intrinsics, not just an assembler. Compilers can often generate good asm from scalar C that doesn't even try to work the way good asm would. Eventually compilers will treat intrinsics as just another C operator as input for the optimizer.
__m128 float with SSE3
This has several advantages:
doesn't require any
movaps
copies to work around destructive shuffles (without AVX):movshdup xmm1, xmm2
's destination is write-only, so it createstmp
out of a dead register for us. This is also why I usedmovehl_ps(tmp, sums)
instead ofmovehl_ps(sums, sums)
.small code-size. The shuffling instructions are small:
movhlps
is 3 bytes,movshdup
is 4 bytes (same asshufps
). No immediate byte is required, so with AVX,vshufps
is 5 bytes butvmovhlps
andvmovshdup
are both 4.I could save another byte with
addps
instead ofaddss
. Since this won't be used inside inner loops, the extra energy to switch the extra transistors is probably negligible. FP exceptions from the upper 3 elements aren't a risk, because all elements hold valid FP data. However, clang/LLVM actually "understands" vector shuffles, and emits better code if it knows that only the low element matters.Like the SSE1 version, adding the odd elements to themselves may cause FP exceptions (like overflow) that wouldn't happen otherwise, but this shouldn't be a problem. Denormals are slow, but IIRC producing a +Inf result isn't on most uarches.
SSE3 optimizing for code-size
If code-size is your major concern, two
haddps
(_mm_hadd_ps
) instructions will do the trick (Paul R's answer). This is also the easiest to type and remember. It is not fast, though. Even Intel Skylake still decodes eachhaddps
to 3 uops, with 6 cycle latency. So even though it saves machine-code bytes (L1 I-cache), it takes up more space in the more-valuable uop-cache. Real use-cases forhaddps
: a transpose-and-sum problem, or doing some scaling at an intermediate step in this SSEatoi()
implementation.__m256 float with AVX:
This version saves a code byte vs. Marat's answer to the AVX question.
__m128d double Double-precision:
Storing to memory and back avoids an ALU uop. That's good if shuffle port pressure, or ALU uops in general, are a bottleneck. (Note that it doesn't need to
sub rsp, 8
or anything because the x86-64 SysV ABI provides a red-zone that signal handlers won't step on.)Some people store to an array and sum all the elements, but compilers usually don't realize that the low element of the array is still there in a register from before the store.
__m128i int32_t Integer:
pshufd
is a convenient copy-and-shuffle. Bit and byte shifts are unfortunately in-place, andpunpckhqdq
puts the high half of the destination in the low half of the result, opposite of the waymovhlps
can extract the high half into a different register.Using
movhlps
for the first step might be good on some CPUs, but only if we have a scratch reg.pshufd
is a safe choice, and fast on everything after Merom.On some CPUs, it's safe to use FP shuffles on integer data. I didn't do this, since on modern CPUs that will at most save 1 or 2 code bytes, with no speed gains (other than code size/alignment effects).
SSE2
所有四个:
r1+r2+r3:
我发现它们的速度与双
HADDPS
大致相同(但我没有太仔细地测量)。SSE2
All four:
r1+r2+r3:
I've found these to be about same speed as double
HADDPS
(but I haven't measured too closely).您可以在 SSE3 中使用两条
HADDPS
指令来完成此操作:这会将总和放入所有元素中。
You can do it in two
HADDPS
instructions in SSE3:This puts the sum in all elements.
我肯定会尝试 SSE 4.2。如果您多次执行此操作(如果性能是一个问题,我假设您是这样做的),您可以使用 (1,1,1,1) 预加载寄存器,然后执行几次 dot4(my_vec(s), one_vec)在它上面。是的,它做了多余的乘法,但现在这些乘法相当便宜,而且这样的操作很可能由水平依赖关系主导,这可能在新的 SSE 点积函数中得到更优化。您应该测试一下它是否优于 Paul R 发布的双水平添加。
我还建议将它与直接标量(或标量 SSE)代码进行比较 - 奇怪的是,它通常更快(通常是因为在内部它是序列化的,但使用寄存器旁路紧密流水线化,其中特殊的水平指令可能无法快速路径(尚未)),除非您正在运行类似 SIMT 的代码,听起来你不是这样的(否则你会做四点积)。
I would definitely give SSE 4.2 a try. If you are doing this multiple times (I assume you are if performance is an issue), you can pre-load a register with (1,1,1,1), and then do several dot4(my_vec(s), one_vec) on it. Yes, it does a superfluous multiply, but those are fairly cheap these days and such an op is likely to be dominated by the horizontal dependencies, which may be more optimized in the new SSE dot product function. You should test to see if it outperforms the double horizontal add Paul R posted.
I also suggest comparing it to straight scalar (or scalar SSE) code - strangely enough it is often faster (usually because internally it is serialized but tightly pipelined using register bypass, where special horizontal instructions may not be fast pathed (yet)) unless you are running SIMT-like code, which it sounds like you are not (otherwise you would do four dot products).
通常,最快可能的方式的问题预先假设一项任务需要在时间关键的循环中多次完成。
那么最快的方法可能是成对工作的迭代方法,它分摊了迭代之间的一些工作。
将向量拆分为低/高部分的总缩减成本为 O(log2(N)),而将向量拆分为偶数/奇数序列的摊余成本为 O(1)。
总和将从累加器的第二个元素(索引 1)(1 次迭代后)中找到,而第一个元素将包含迄今为止所有元素的总减少。
我怀疑,对于 3 或 4 的向量长度,这是否会比 Cordes 先生提出的更快,但是对于 16 或 8 位数据,这种方法应该被证明是值得的。那么当然需要分别进行3轮或4轮才能得到结果。
如果水平运算恰好是求和——那么每次迭代实际上可以只使用一个
hadd
。Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.
Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.
The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).
The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.
I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.
If the horizontal operation happens to be sum -- then one can actually use just a single
hadd
per iteration.