使用 SSE 计算无符号整数之间的绝对差
在 C 中是否有一种无分支技术来计算两个无符号整数之间的绝对差?例如,给定变量 a 和 b,当 a=3、b=5 或 b=3、a=5 时,我想要值 2。理想情况下,我还希望能够使用 SSE 寄存器对计算进行矢量化。
In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For example given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
有多种方法可以实现,我只提一种:
SSE4
PMINUD
和PMAXUD
来分隔寄存器 #1 中较大的值和寄存器中较小的值#2.MMX/SSE2
PCMPGTD
。使用此结果作为掩码。(ab)
和(ba)
的结果POR ( PAND ( mask, ab ), PANDN ( mask, ba ) )
选择正确的绝对差值。There are several ways to do it, I'll just mention one:
SSE4
PMINUD
andPMAXUD
to separate the larger value in register #1, and the smaller value in register #2.MMX/SSE2
PCMPGTD
. Use this result as a mask.(a-b)
and(b-a)
POR ( PAND ( mask, a-b ), PANDN ( mask, b-a ) )
to select the correct value for the absolute difference.来自 tommesani.com,此问题的一种解决方案是使用两次饱和无符号减法。
由于饱和减法永远不会低于 0,因此您可以计算:
r1 = (ab).饱和
r2 = (ba).saturating
如果 a 大于 b,则 r1 将包含答案,而 r2 将为 0,反之亦然(如果 b>a)。
将两个部分结果进行“或”运算将得到所需的结果。
根据VTUNE 用户手册,PSUBUSB/PSUBUSW 可用于 128 位寄存器,因此您应该能够通过这种方式获得大量并行化。
From tommesani.com, one solution for this problem is to use saturating unsigned subtraction twice.
As the saturating subtraction never goes below 0, you compute:
r1 = (a-b).saturating
r2 = (b-a).saturating
If a is greater than b, r1 will contain the answer, and r2 will be 0, and vice-versa for b>a.
ORing the two partial results together will yield the desired result.
According to the VTUNE users manual, PSUBUSB/PSUBUSW is available for 128-bit registers, so you should be able to get a ton of parallelization this way.
你当然可以使用 SSE 寄存器,但编译器可能会为你做这件事
you can certainly use SSE registers, but compiler may do this for you anyways
SSE2:
看起来和 Phernost 的第二个函数的速度差不多。有时 GCC 将其安排得更快一个完整的周期,有时则稍慢一些。
SSSE3:
比以前稍微快一点。根据循环外部的声明方式,存在很多变化。 (例如,使
a
和b
volatile
会使事情变得更快!这似乎是对调度的随机影响。)但这始终是最快的一个周期左右。SSE4(thx rwong):
无法测试这个。
SSE2:
Seems to be about the same speed as Phernost's second function. Sometimes GCC schedules it to be a full cycle faster, other times a little slower.
SSSE3:
Ever so slightly faster than previous. There is a lot of variation depending on how things outside the loop are declared. (For example, making
a
andb
volatile
makes things faster! It appears to be a random effect on scheduling.) But this is consistently fastest by a cycle or so.SSE4 (thx rwong):
Can't test this.
计算差异并返回绝对值
这比使用有符号比较操作需要少一次操作,并且产生更少的寄存器压力。
与以前相同的寄存器压力、多 2 个操作、更好的依赖链分支和合并、微指令解码的指令配对以及单独的单元利用。尽管这需要加载,但可能会超出缓存。这之后我就没有想法了。
在 Core2Duo 上对每个版本进行 200 万次迭代计时后,差异是无法估量的。所以选择更容易理解的。
compute the difference and return the absolute value
This requires one less operation that using the signed compare op, and produces less register pressure.
Same amount of register pressure as before, 2 more ops, better branch and merging of dependency chains, instruction pairing for uops decoding, and separate unit utilization. Although this requires a load, which may be out of cache. I'm out of ideas after this one.
After timing each version with 2 million iterations on a Core2Duo, differences are immeasurable. So pick whatever is easier to understand.
以下一项或多项可能会导致无分支代码,具体取决于机器和编译器,因为条件表达式都非常简单。
我还没有阅读所有 sse 答案,但可能以下一些内容在矢量代码中表示;当然,下面的所有内容都是可矢量化的(如果您一开始就有无符号比较,或者通过首先切换 MSB 来伪造它)。我认为对这个问题提供一些实用的标量答案会很有帮助。
这适用于 x86_64(或 64 位临时值基本上免费的任何东西)
One or more of the below will likely result in branchless code, depending on the machine and compiler, since the conditional expressions are all very simple.
I haven't been through all the sse answers but possibly some of the below are represented in the vector code; certainly all the below are vectorizable (if you have the unsigned compare to begin with, or fake it by toggling the msb first.). I thought it would be helpful to have some practical scalar answers to the question.
This will work on x86_64 (or anything where 64-bit temps are basically free)
试试这个(假设第二个补码,根据您要求 SSE 的事实来判断,这是可以的):
说明:符号位(位 31)向下传播到第一位。 | 的 | 1 部分确保乘数为 1 或 -1。现代 CPU 上的乘法运算速度很快。
Try this (assumes 2nd complements, which is OK judgning by the fact that you're asking for SSE):
Explanation: sign-bit (bit 31) gets propagated down to 1st bit. the | 1 part ensures that the multiplier is either 1 or -1. Multiplications are fast on modern CPUs.
呃……这很简单……
很容易矢量化(使用 SSE3 作为):
a 和 b 是无符号整数。考虑 a=0 和 b=0xffffffff。正确的绝对差是 0xffffffff,但你的解决方案将给出 1。
Erm ... its pretty easy ...
Easily vectorisable (Using SSE3 as):
a and b are unsigned integers. Consider a=0 and b=0xffffffff. The correct absolute difference is 0xffffffff, but your solution will give 1.