当前位置：文江博客话题详情

使用 SSE 计算无符号整数之间的绝对差

发布于 2024-09-12 04:07:59 字数 113 浏览 15 评论 0原文

在 C 中是否有一种无分支技术来计算两个无符号整数之间的绝对差？例如，给定变量 a 和 b，当 a=3、b=5 或 b=3、a=5 时，我想要值 2。理想情况下，我还希望能够使用 SSE 寄存器对计算进行矢量化。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由如风 2024-09-19 04:07:59

有多种方法可以实现，我只提一种：

SSE4

使用 PMINUD 和 PMAXUD 来分隔寄存器 #1 中较大的值和寄存器中较小的值#2.
减去它们。

MMX/SSE2

翻转两个值的符号位，因为下一条指令只接受有符号整数比较。
PCMPGTD。使用此结果作为掩码。
计算 (ab) 和 (ba) 的结果
使用 POR ( PAND ( mask, ab ), PANDN ( mask, ba ) )选择正确的绝对差值。

回复收藏 0 原文

脸赞 2024-09-19 04:07:59

来自 tommesani.com，此问题的一种解决方案是使用两次饱和无符号减法。

由于饱和减法永远不会低于 0，因此您可以计算：
r1 = (ab).饱和
r2 = (ba).saturating

如果 a 大于 b，则 r1 将包含答案，而 r2 将为 0，反之亦然（如果 b>a）。
将两个部分结果进行“或”运算将得到所需的结果。

根据VTUNE 用户手册，PSUBUSB/PSUBUSW 可用于 128 位寄存器，因此您应该能够通过这种方式获得大量并行化。

回复收藏 0 原文

南街九尾狐 2024-09-19 04:07:59

max(i,j) - min(i,j)
(i>j)*(i-j) + (j>i)*(j-i)

你当然可以使用 SSE 寄存器，但编译器可能会为你做这件事

max(i,j) - min(i,j)
(i>j)*(i-j) + (j>i)*(j-i)

you can certainly use SSE registers, but compiler may do this for you anyways

回复收藏 0 原文

薄暮涼年 2024-09-19 04:07:59

SSE2：

看起来和 Phernost 的第二个函数的速度差不多。有时 GCC 将其安排得更快一个完整的周期，有时则稍慢一些。

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( a, b ); // get signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_andnot_si128( big, mask ); // mask = 0x7ffff... if negating
diff = _mm_xor_si128( diff, mask ); // 1's complement except MSB
diff = _mm_sub_epi32( diff, mask ); // add 1 and restore MSB

SSSE3：

比以前稍微快一点。根据循环外部的声明方式，存在很多变化。（例如，使 a 和 b volatile 会使事情变得更快！这似乎是对调度的随机影响。）但这始终是最快的一个周期左右。

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( b, a ); // get reverse signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_xor_si128( mask, big ); // mask cannot be 0 for PSIGND insn
diff = _mm_sign_epi32( diff, mask ); // negate diff if needed

SSE4（thx rwong）：

无法测试这个。

__m128i diff = _mm_sub_epi32( _mm_max_epu32( a, b ), _mm_min_epu32( a, b ) );

SSE2:

Seems to be about the same speed as Phernost's second function. Sometimes GCC schedules it to be a full cycle faster, other times a little slower.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( a, b ); // get signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_andnot_si128( big, mask ); // mask = 0x7ffff... if negating
diff = _mm_xor_si128( diff, mask ); // 1's complement except MSB
diff = _mm_sub_epi32( diff, mask ); // add 1 and restore MSB

SSSE3:

Ever so slightly faster than previous. There is a lot of variation depending on how things outside the loop are declared. (For example, making a and b volatile makes things faster! It appears to be a random effect on scheduling.) But this is consistently fastest by a cycle or so.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( b, a ); // get reverse signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_xor_si128( mask, big ); // mask cannot be 0 for PSIGND insn
diff = _mm_sign_epi32( diff, mask ); // negate diff if needed

SSE4 (thx rwong):

Can't test this.

__m128i diff = _mm_sub_epi32( _mm_max_epu32( a, b ), _mm_min_epu32( a, b ) );

回复收藏 0 原文

柠檬色的秋千 2024-09-19 04:07:59

计算差异并返回绝对值

__m128i diff = _mm_sub_epi32(a, b);  
__m128i mask = _mm_xor_si128(diff, a);
mask = _mm_xor_si128(mask, b);
mask = _mm_srai_epi32(mask, 31);
diff = _mm_xor_si128(diff, mask);  
mask = _mm_srli_epi32(mask, 31);  
diff = _mm_add_epi32(diff, mask);

这比使用有符号比较操作需要少一次操作，并且产生更少的寄存器压力。

与以前相同的寄存器压力、多 2 个操作、更好的依赖链分支和合并、微指令解码的指令配对以及单独的单元利用。尽管这需要加载，但可能会超出缓存。这之后我就没有想法了。

__m128i mask, diff;
diff = _mm_set1_epi32(-1<<31); // dependency branch after
a = _mm_add_epi32(a, diff); // arithmetic sign flip
b = _mm_xor_si128(b, diff); // bitwise sign flip parallel with 'add' unit
diff = _mm_xor_si128(a, b); // reduce uops, instruction already decoded
mask = _mm_cmpgt_epi32(b, a); // parallel with xor
mask = _mm_and_si128(mask, diff); // dependency merge, branch after
a = _mm_xor_si128(a, mask); // if 2 'bit' units in CPU, parallel with next
b = _mm_xor_si128(b, mask); // reduce uops, instruction already decoded
diff = _mm_sub_epi32(a, b); // result

在 Core2Duo 上对每个版本进行 200 万次迭代计时后，差异是无法估量的。所以选择更容易理解的。

compute the difference and return the absolute value

__m128i diff = _mm_sub_epi32(a, b);  
__m128i mask = _mm_xor_si128(diff, a);
mask = _mm_xor_si128(mask, b);
mask = _mm_srai_epi32(mask, 31);
diff = _mm_xor_si128(diff, mask);  
mask = _mm_srli_epi32(mask, 31);  
diff = _mm_add_epi32(diff, mask);

This requires one less operation that using the signed compare op, and produces less register pressure.

Same amount of register pressure as before, 2 more ops, better branch and merging of dependency chains, instruction pairing for uops decoding, and separate unit utilization. Although this requires a load, which may be out of cache. I'm out of ideas after this one.

__m128i mask, diff;
diff = _mm_set1_epi32(-1<<31); // dependency branch after
a = _mm_add_epi32(a, diff); // arithmetic sign flip
b = _mm_xor_si128(b, diff); // bitwise sign flip parallel with 'add' unit
diff = _mm_xor_si128(a, b); // reduce uops, instruction already decoded
mask = _mm_cmpgt_epi32(b, a); // parallel with xor
mask = _mm_and_si128(mask, diff); // dependency merge, branch after
a = _mm_xor_si128(a, mask); // if 2 'bit' units in CPU, parallel with next
b = _mm_xor_si128(b, mask); // reduce uops, instruction already decoded
diff = _mm_sub_epi32(a, b); // result

After timing each version with 2 million iterations on a Core2Duo, differences are immeasurable. So pick whatever is easier to understand.

回复收藏 0 原文

最冷一天 2024-09-19 04:07:59

以下一项或多项可能会导致无分支代码，具体取决于机器和编译器，因为条件表达式都非常简单。

我还没有阅读所有 sse 答案，但可能以下一些内容在矢量代码中表示；当然，下面的所有内容都是可矢量化的（如果您一开始就有无符号比较，或者通过首先切换 MSB 来伪造它）。我认为对这个问题提供一些实用的标量答案会很有帮助。

unsigned udiff( unsigned a, unsigned b )
{
      unsigned result = a-b;   // ok if a<b;
      if(a <b ) result = -result; 
      return result;
}
unsigned udiff( unsigned a, unsigned b )
{
      unsigned n =(a<b)? (unsigned)-1 : 0u;
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}


unsigned udiff( unsigned a, unsigned b )
{
      unsigned axb = a^b;
      if( a < b )  axb = 0;
      return (axb^b) - (axb^a);  // a-b, or b-a
}

这适用于 x86_64（或 64 位临时值基本上免费的任何东西）

unsigned udiff( unsigned a, unsigned b )
{
      unsigned n= (unsigned)( 
         (long long)((unsigned long long)a - (unsigned long long)b)>>32 
                      ); // same n as 2nd example
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}

One or more of the below will likely result in branchless code, depending on the machine and compiler, since the conditional expressions are all very simple.

I haven't been through all the sse answers but possibly some of the below are represented in the vector code; certainly all the below are vectorizable (if you have the unsigned compare to begin with, or fake it by toggling the msb first.). I thought it would be helpful to have some practical scalar answers to the question.

unsigned udiff( unsigned a, unsigned b )
{
      unsigned result = a-b;   // ok if a<b;
      if(a <b ) result = -result; 
      return result;
}
unsigned udiff( unsigned a, unsigned b )
{
      unsigned n =(a<b)? (unsigned)-1 : 0u;
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}


unsigned udiff( unsigned a, unsigned b )
{
      unsigned axb = a^b;
      if( a < b )  axb = 0;
      return (axb^b) - (axb^a);  // a-b, or b-a
}

This will work on x86_64 (or anything where 64-bit temps are basically free)

unsigned udiff( unsigned a, unsigned b )
{
      unsigned n= (unsigned)( 
         (long long)((unsigned long long)a - (unsigned long long)b)>>32 
                      ); // same n as 2nd example
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}

回复收藏 0 原文

绝影如岚 2024-09-19 04:07:59

试试这个（假设第二个补码，根据您要求 SSE 的事实来判断，这是可以的）：

int d = a-b;
int ad = ((d >> 30) | 1) * d;

说明：符号位（位 31）向下传播到第一位。 | 的 | 1 部分确保乘数为 1 或 -1。现代 CPU 上的乘法运算速度很快。

Try this (assumes 2nd complements, which is OK judgning by the fact that you're asking for SSE):

int d = a-b;
int ad = ((d >> 30) | 1) * d;

Explanation: sign-bit (bit 31) gets propagated down to 1st bit. the | 1 part ensures that the multiplier is either 1 or -1. Multiplications are fast on modern CPUs.

回复收藏 0 原文

诗化ㄋ丶相逢 2024-09-19 04:07:59

呃……这很简单……

int diff = abs( a - b );

很容易矢量化（使用 SSE3 作为）：

__m128i sseDiff = _mm_abs_epi32( _mm_sub_epi32( a, b ) );

a 和 b 是无符号整数。考虑 a=0 和 b=0xffffffff。正确的绝对差是 0xffffffff，但你的解决方案将给出 1。

Erm ... its pretty easy ...

int diff = abs( a - b );

Easily vectorisable (Using SSE3 as):

__m128i sseDiff = _mm_abs_epi32( _mm_sub_epi32( a, b ) );

a and b are unsigned integers. Consider a=0 and b=0xffffffff. The correct absolute difference is 0xffffffff, but your solution will give 1.

回复收藏 0 原文

~没有更多了~

关于作者

苏璃陌

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

使用 SSE 计算无符号整数之间的绝对差

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

小瓶盖

wxsp_Ukbq8xGR

1638627670

仅一夜美梦

夜访吸血鬼

近卫軍团

友情链接

使用 SSE 计算无符号整数之间的绝对差

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

小瓶盖

wxsp_Ukbq8xGR

1638627670

仅一夜美梦

夜访吸血鬼

近卫軍团

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。