如何在C中进行无符号饱和加法?

发布于 2024-07-06 03:12:54 字数 189 浏览 10 评论 0原文

在 C 中编写饱和加法的最佳(最干净、最有效)方法是什么?

函数或宏应添加两个无符号输入(需要 16 位和 32 位版本),并在总和溢出时返回全位 1(0xFFFF 或 0xFFFFFFFF)。

目标是使用 gcc (4.1.2) 和 Visual Studio 的 x86 和 ARM(仅用于模拟,因此可以使用回退实现)。

What is the best (cleanest, most efficient) way to write saturating addition in C?

The function or macro should add two unsigned inputs (need both 16- and 32-bit versions) and return all-bits-one (0xFFFF or 0xFFFFFFFF) if the sum overflows.

Target is x86 and ARM using gcc (4.1.2) and Visual Studio (for simulation only, so a fallback implementation is OK there).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

顾忌 2024-07-13 03:12:54

C++ 模板,防止将其与有符号类型一起使用,并且没有 -1 转换:

template<typename T, typename = std::enable_if_t<std::is_unsigned_v<T>>>
T saturatingAdd(T a, T b)
{
  T c = a + b;
  return c < a ? std::numeric_limits<T>::MAX : c;
}

C++ template, with protection against using it with signed types, and no -1 casting:

template<typename T, typename = std::enable_if_t<std::is_unsigned_v<T>>>
T saturatingAdd(T a, T b)
{
  T c = a + b;
  return c < a ? std::numeric_limits<T>::MAX : c;
}
谎言月老 2024-07-13 03:12:54

您可能需要此处的可移植 C 代码,您的编译器会将其转换为正确的 ARM 程序集。 ARM 有条件移动,并且这些移动可以以溢出为条件。 然后,该算法变为:如果检测到溢出,则添加并有条件地将目标设置为无符号 (-1)。

uint16_t add16(uint16_t a, uint16_t b)
{
  uint16_t c = a + b;
  if (c < a)  /* Can only happen due to overflow */
    c = -1;
  return c;
}

请注意,这与其他算法的不同之处在于它纠正溢出,而不是依赖其他计算来检测溢出。

x86-64 clang 3.7 -O3 输出add32:明显优于任何其他答案:

add     edi, esi
mov     eax, -1
cmovae  eax, edi
ret

ARMv7:adds32 的 gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm 输出

adds    r0, r0, r1      @ c, a, b
it      cs
movcs   r0, #-1         @ conditional-move
bx      lr

16位:仍然不使用 ARM 的无符号- 饱和加法指令 (UADD16)

add     r1, r1, r0        @ tmp114, a
movw    r3, #65535      @ tmp116,
uxth    r1, r1  @ c, tmp114
cmp     r0, r1    @ a, c
ite     ls        @
movls   r0, r1        @,, c
movhi   r0, r3        @,, tmp116
bx      lr  @

You probably want portable C code here, which your compiler will turn into proper ARM assembly. ARM has conditional moves, and these can be conditional on overflow. The algorithm then becomes: add and conditionally set the destination to unsigned(-1), if overflow was detected.

uint16_t add16(uint16_t a, uint16_t b)
{
  uint16_t c = a + b;
  if (c < a)  /* Can only happen due to overflow */
    c = -1;
  return c;
}

Note that this differs from the other algorithms in that it corrects overflow, instead of relying on another calculation to detect overflow.

x86-64 clang 3.7 -O3 output for adds32: significantly better than any other answer:

add     edi, esi
mov     eax, -1
cmovae  eax, edi
ret

ARMv7: gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm output for adds32:

adds    r0, r0, r1      @ c, a, b
it      cs
movcs   r0, #-1         @ conditional-move
bx      lr

16bit: still doesn't use ARM's unsigned-saturating add instruction (UADD16)

add     r1, r1, r0        @ tmp114, a
movw    r3, #65535      @ tmp116,
uxth    r1, r1  @ c, tmp114
cmp     r0, r1    @ a, c
ite     ls        @
movls   r0, r1        @,, c
movhi   r0, r3        @,, tmp116
bx      lr  @
夜夜流光相皎洁 2024-07-13 03:12:54

在没有条件跳转的 IA32 中:

uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
  __asm
  {
    mov eax,a
    xor edx,edx
    add eax,b
    setnc dl
    dec edx
    or eax,edx
  }
#elif defined ARM
  // ARM code
#else
  // non-IA32/ARM way, copy from above
#endif
}

In IA32 without conditional jumps:

uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
  __asm
  {
    mov eax,a
    xor edx,edx
    add eax,b
    setnc dl
    dec edx
    or eax,edx
  }
#elif defined ARM
  // ARM code
#else
  // non-IA32/ARM way, copy from above
#endif
}
没有心的人 2024-07-13 03:12:54

在普通的C中:

uint16_t sadd16(uint16_t a, uint16_t b) {
  return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
     
uint32_t sadd32(uint32_t a, uint32_t b) {
  return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}

这几乎是宏观化的,直接传达了含义。

In plain C:

uint16_t sadd16(uint16_t a, uint16_t b) {
  return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
     
uint32_t sadd32(uint32_t a, uint32_t b) {
  return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}

which is almost macro-ized and directly conveys the meaning.

柏林苍穹下 2024-07-13 03:12:54
int saturating_add(int x, int y)
{
    int w = sizeof(int) << 3;
    int msb = 1 << (w-1);

    int s = x + y;
    int sign_x = msb & x;
    int sign_y = msb & y;
    int sign_s = msb & s;

    int nflow = sign_x && sign_y && !sign_s;
    int pflow = !sign_x && !sign_y && sign_s;

    int nmask = (~!nflow + 1);
    int pmask = (~!pflow + 1);

    return (nmask & ((pmask & s) | (~pmask & ~msb))) | (~nmask & msb);
}

此实现不使用控制流、campare 运算符(==, !=) 和?: 运算符。 它只使用按位运算符和逻辑运算符。

int saturating_add(int x, int y)
{
    int w = sizeof(int) << 3;
    int msb = 1 << (w-1);

    int s = x + y;
    int sign_x = msb & x;
    int sign_y = msb & y;
    int sign_s = msb & s;

    int nflow = sign_x && sign_y && !sign_s;
    int pflow = !sign_x && !sign_y && sign_s;

    int nmask = (~!nflow + 1);
    int pmask = (~!pflow + 1);

    return (nmask & ((pmask & s) | (~pmask & ~msb))) | (~nmask & msb);
}

This implementation doesn't use control flows, campare operators(==, !=) and the ?: operator. It just uses bitwise operators and logical operators.

能怎样 2024-07-13 03:12:54

无分支 x86 asm 解决方案的替代方案是(AT&T 语法,eax 和 ebx 中的 a 和 b,结果为 eax):

add %eax,%ebx
sbb $0,%ebx

An alternative to the branch free x86 asm solution is (AT&T syntax, a and b in eax and ebx, result in eax):

add %eax,%ebx
sbb $0,%ebx
最初的梦 2024-07-13 03:12:54

如果您关心性能,您确实希望在 SIMD 中执行此类操作,其中 x86 具有本机饱和算术。

由于标量数学中缺乏饱和算术,人们可能会遇到这样的情况:在 4 变量宽 SIMD 上完成的运算比等效的 C 快超过 4 倍(对于 8- 变量也是如此)变量范围 SIMD):

sub8x8_dct8_c: 1332 clocks
sub8x8_dct8_mmx: 182 clocks
sub8x8_dct8_sse2: 127 clocks

If you care about performance, you really want to do this sort of stuff in SIMD, where x86 has native saturating arithmetic.

Because of this lack of saturating arithmetic in scalar math, one can get cases in which operations done on 4-variable-wide SIMD is more than 4 times faster than the equivalent C (and correspondingly true with 8-variable-wide SIMD):

sub8x8_dct8_c: 1332 clocks
sub8x8_dct8_mmx: 182 clocks
sub8x8_dct8_sse2: 127 clocks
油焖大侠 2024-07-13 03:12:54

零分支解决方案:

uint32_t sadd32(uint32_t a, uint32_t b)
{
    uint64_t s = (uint64_t)a+b;
    return -(s>>32) | (uint32_t)s;
}

一个好的编译器会对此进行优化,以避免执行任何实际的 64 位算术(s>>32 只是进位标志,而 -(s>>32 )sbb %eax,%eax) 的结果。

在 x86 asm 中(AT&T 语法,eaxebx 中的 ab,结果为 eax):

add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax

8 位和 16 位版本应该是显而易见的。 签名版本可能需要更多工作。

Zero branch solution:

uint32_t sadd32(uint32_t a, uint32_t b)
{
    uint64_t s = (uint64_t)a+b;
    return -(s>>32) | (uint32_t)s;
}

A good compiler will optimize this to avoid doing any actual 64-bit arithmetic (s>>32 will merely be the carry flag, and -(s>>32) is the result of sbb %eax,%eax).

In x86 asm (AT&T syntax, a and b in eax and ebx, result in eax):

add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax

8- and 16-bit versions should be obvious. Signed version might require a bit more work.

格子衫的從容 2024-07-13 03:12:54
uint32_t saturate_add32(uint32_t a, uint32_t b)
{
    uint32_t sum = a + b;
    if ((sum < a) || (sum < b))
        return ~((uint32_t)0);
    else
        return sum;
} /* saturate_add32 */

uint16_t saturate_add16(uint16_t a, uint16_t b)
{
    uint16_t sum = a + b;
    if ((sum < a) || (sum < b))
        return ~((uint16_t)0);
    else
        return sum;
} /* saturate_add16 */

编辑:现在您已经发布了您的版本,我不确定我的版本是否更干净/更好/更高效/更可靠。

uint32_t saturate_add32(uint32_t a, uint32_t b)
{
    uint32_t sum = a + b;
    if ((sum < a) || (sum < b))
        return ~((uint32_t)0);
    else
        return sum;
} /* saturate_add32 */

uint16_t saturate_add16(uint16_t a, uint16_t b)
{
    uint16_t sum = a + b;
    if ((sum < a) || (sum < b))
        return ~((uint16_t)0);
    else
        return sum;
} /* saturate_add16 */

Edit: Now that you've posted your version, I'm not sure mine is any cleaner/better/more efficient/more studly.

流心雨 2024-07-13 03:12:54

我们当前使用的实现是:

#define sadd16(a, b)  (uint16_t)( ((uint32_t)(a)+(uint32_t)(b)) > 0xffff ? 0xffff : ((a)+(b)))
#define sadd32(a, b)  (uint32_t)( ((uint64_t)(a)+(uint64_t)(b)) > 0xffffffff ? 0xffffffff : ((a)+(b)))

The current implementation we are using is:

#define sadd16(a, b)  (uint16_t)( ((uint32_t)(a)+(uint32_t)(b)) > 0xffff ? 0xffff : ((a)+(b)))
#define sadd32(a, b)  (uint32_t)( ((uint64_t)(a)+(uint64_t)(b)) > 0xffffffff ? 0xffffffff : ((a)+(b)))
顾挽 2024-07-13 03:12:54

我不确定这是否比 Skizz 的解决方案(始终进行配置文件)更快,但这里有一个替代的无分支装配解决方案。 请注意,这需要条件移动(CMOV)指令,我不确定您的目标是否可用。


uint32_t sadd32(uint32_t a, uint32_t b)
{
    __asm
    {
        movl eax, a
        addl eax, b
        movl edx, 0xffffffff
        cmovc eax, edx
    }
}

I'm not sure if this is faster than Skizz's solution (always profile), but here's an alternative no-branch assembly solution. Note that this requires the conditional move (CMOV) instruction, which I'm not sure is available on your target.


uint32_t sadd32(uint32_t a, uint32_t b)
{
    __asm
    {
        movl eax, a
        addl eax, b
        movl edx, 0xffffffff
        cmovc eax, edx
    }
}
余生共白头 2024-07-13 03:12:54

我想,x86 的最佳方法是使用内联汇编器在添加后检查溢出标志。 比如:

add eax, ebx
jno @@1
or eax, 0FFFFFFFFh
@@1:
.......

它不是很便携,但恕我直言,这是最有效的方法。

I suppose, the best way for x86 is to use inline assembler to check overflow flag after addition. Something like:

add eax, ebx
jno @@1
or eax, 0FFFFFFFFh
@@1:
.......

It's not very portable, but IMHO the most efficient way.

独孤求败 2024-07-13 03:12:54

以防万一有人想知道不使用 2 的补码 32 位整数进行分支的实现。

警告! 此代码使用未定义的操作:“右移 -1”,因此利用了 Intel Pentium SAL 指令 将计数操作数屏蔽为 5 位。

int32_t sadd(int32_t a, int32_t b){
    int32_t sum = a+b;
    int32_t overflow = ((a^sum)&(b^sum))>>31;
    return (overflow<<31)^(sum>>overflow);
 }

这是我所知道的最好的实现

Just in case someone wants to know an implementation without branching using 2's complement 32bit integers.

Warning! This code uses the undefined operation: "shift right by -1" and therefore exploits the property of the Intel Pentium SAL instruction to mask the count operand to 5 bits.

int32_t sadd(int32_t a, int32_t b){
    int32_t sum = a+b;
    int32_t overflow = ((a^sum)&(b^sum))>>31;
    return (overflow<<31)^(sum>>overflow);
 }

It's the best implementation known to me

时光倒影 2024-07-13 03:12:54

最佳性能通常涉及内联汇编(正如一些人已经指出的那样)。

但对于可移植 C 来说,这些函数只涉及一次比较,没有类型转换(因此我认为是最佳的):

unsigned saturate_add_uint(unsigned x, unsigned y)
{
    if (y > UINT_MAX - x) return UINT_MAX;
    return x + y;
}

unsigned short saturate_add_ushort(unsigned short x, unsigned short y)
{
    if (y > USHRT_MAX - x) return USHRT_MAX;
    return x + y;
}

作为宏,它们变成:

SATURATE_ADD_UINT(x, y) (((y)>UINT_MAX-(x)) ? UINT_MAX : ((x)+(y)))
SATURATE_ADD_USHORT(x, y) (((y)>SHRT_MAX-(x)) ? USHRT_MAX : ((x)+(y)))

我将“unsigned long”和“unsigned long long”的版本留给读者作为练习。 ;-)

The best performance will usually involve inline assembly (as some have already stated).

But for portable C, these functions only involve one comparison and no type-casting (and thus I believe optimal):

unsigned saturate_add_uint(unsigned x, unsigned y)
{
    if (y > UINT_MAX - x) return UINT_MAX;
    return x + y;
}

unsigned short saturate_add_ushort(unsigned short x, unsigned short y)
{
    if (y > USHRT_MAX - x) return USHRT_MAX;
    return x + y;
}

As macros, they become:

SATURATE_ADD_UINT(x, y) (((y)>UINT_MAX-(x)) ? UINT_MAX : ((x)+(y)))
SATURATE_ADD_USHORT(x, y) (((y)>SHRT_MAX-(x)) ? USHRT_MAX : ((x)+(y)))

I leave versions for 'unsigned long' and 'unsigned long long' as an exercise to the reader. ;-)

遗心遗梦遗幸福 2024-07-13 03:12:54

在 ARM 中,您可能已经内置了饱和算术。 ARMv5 DSP 扩展可以使寄存器饱和到任何位长度。 另外,ARM 饱和度通常很便宜,因为您可以有条件地执行大多数指令。

ARMv6 甚至具有 32 位和压缩数字的饱和加法、减法和所有其他内容。

在 x86 上,您可以通过 MMX 或 SSE 获得饱和算术。

所有这些都需要汇编程序,所以这不是您所要求的。

还有一些 C 技巧可以进行饱和算术。 这段小代码对双字的四个字节进行饱和加法。 它基于并行计算 32 个半加器的思想,例如在没有进位溢出的情况下将数字相加。

这是首先完成的。 然后计算进位,相加,如果加法溢出则用掩码替换。

uint32_t SatAddUnsigned8(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80808080;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 7);
  return (x ^ t0) | t1;
}

您可以通过更改符号掩码常量和底部的移位来获得 16 位(或任何类型的位字段)相同的结果,如下所示:

uint32_t SatAddUnsigned16(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80008000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 15);
  return (x ^ t0) | t1;
}

uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
  uint32_t signmask = 0x80000000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 31);
  return (x ^ t0) | t1;
}

上面的代码对 16 位和 32 位值执行相同的操作。

如果您不需要函数并行添加和饱和多个值的功能,只需屏蔽您需要的位即可。 在 ARM 上,您还需要更改符号掩码常量,因为 ARM 无法在单个周期内加载所有可能的 32 位常量。

编辑:并行版本很可能比直接方法慢,但如果您必须一次饱和多个值,它们会更快。

In ARM you may already have saturated arithmetic built-in. The ARMv5 DSP-extensions can saturate registers to any bit-length. Also on ARM saturation is usually cheap because you can excute most instructions conditional.

ARMv6 even has saturated addition, subtraction and all the other stuff for 32 bits and packed numbers.

On the x86 you get saturated arithmetic either via MMX or SSE.

All this needs assembler, so it's not what you've asked for.

There are C-tricks to do saturated arithmetic as well. This little code does saturated addition on four bytes of a dword. It's based on the idea to calculate 32 half-adders in parallel, e.g. adding numbers without carry overflow.

This is done first. Then the carries are calculated, added and replaced with a mask if the addition would overflow.

uint32_t SatAddUnsigned8(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80808080;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 7);
  return (x ^ t0) | t1;
}

You can get the same for 16 bits (or any kind of bit-field) by changing the signmask constant and the shifts at the bottom like this:

uint32_t SatAddUnsigned16(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80008000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 15);
  return (x ^ t0) | t1;
}

uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
  uint32_t signmask = 0x80000000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 31);
  return (x ^ t0) | t1;
}

Above code does the same for 16 and 32 bit values.

If you don't need the feature that the functions add and saturate multiple values in parallel just mask out the bits you need. On ARM you also want to change the signmask constant because ARM can't load all possible 32 bit constants in a single cycle.

Edit: The parallel versions are most likely slower than the straight forward methods, but they are faster if you have to saturate more than one value at a time.

尝蛊 2024-07-13 03:12:54
//function-like macro to add signed vals, 
//then test for overlow and clamp to max if required
#define SATURATE_ADD(a,b,val)  ( {\
if( (a>=0) && (b>=0) )\
{\
    val = a + b;\
    if (val < 0) {val=0x7fffffff;}\
}\
else if( (a<=0) && (b<=0) )\
{\
    val = a + b;\
    if (val > 0) {val=-1*0x7fffffff;}\
}\
else\
{\
    val = a + b;\
}\
})

我做了一个快速测试,似乎有效,但还没有广泛地攻击它! 这适用于 SIGNED 32 位。
op :网页上使用的编辑器不允许我发布宏,即它不理解非缩进语法等!

//function-like macro to add signed vals, 
//then test for overlow and clamp to max if required
#define SATURATE_ADD(a,b,val)  ( {\
if( (a>=0) && (b>=0) )\
{\
    val = a + b;\
    if (val < 0) {val=0x7fffffff;}\
}\
else if( (a<=0) && (b<=0) )\
{\
    val = a + b;\
    if (val > 0) {val=-1*0x7fffffff;}\
}\
else\
{\
    val = a + b;\
}\
})

I did a quick test and seems to work, but not extensively bashed it yet! This works with SIGNED 32 bit.
op : the editor used on the web page does not let me post a macro ie its not understanding non-indented syntax etc!

审判长 2024-07-13 03:12:54

使用 C++,您可以编写 Remo.D 解决方案的更灵活变体:

template<typename T>
T sadd(T first, T second)
{
    static_assert(std::is_integral<T>::value, "sadd is not defined for non-integral types");
    return first > std::numeric_limits<T>::max() - second ? std::numeric_limits<T>::max() : first + second;
}

这可以轻松转换为 C - 使用 limits.h 中定义的限制。 另请注意,固定宽度整数类型可能在您的系统上不可用。

Using C++ you could write a more flexible variant of Remo.D's solution:

template<typename T>
T sadd(T first, T second)
{
    static_assert(std::is_integral<T>::value, "sadd is not defined for non-integral types");
    return first > std::numeric_limits<T>::max() - second ? std::numeric_limits<T>::max() : first + second;
}

This can be easily translated to C - using the limits defined in limits.h. Please also note that the Fixed width integer types might not been available on your system.

人生百味 2024-07-13 03:12:54

饱和算术不是 C 的标准,但它通常通过编译器内部函数实现,因此最有效的方法并不是最干净的。 您必须添加 #ifdef 块才能选择正确的方式。 MSalters 的答案是 x86 架构最快。 对于 ARM,16 位版本需要使用 _arm_qadd16 (Microsoft Visual Studio) 的 __qadd16 函数(ARM 编译器),32 位版本需要使用 __qadd版本。 它们将被自动转换为一条 ARM 指令。

链接:

Saturation arithmetic is not standard for C, but it's often implemented via compiler intrinsics, so the most efficient way will not be the cleanest. You must add #ifdef blocks to select the proper way. MSalters's answer is the fastest for x86 architecture. For ARM you need to use __qadd16 function (ARM compiler) of _arm_qadd16 (Microsoft Visual Studio) for 16 bit version and __qadd for 32-bit version. They'll be automatically translated to one ARM instruction.

Links:

2024-07-13 03:12:54

我将添加上面尚未提到的解决方案。

Intel x86 中存在 ADC 指令。 它表示为 _addcarry_u32() 内部函数。 对于ARM来说应该有类似的内在。

这使我们能够为 Intel x86 实现非常快速的 uint32_t 饱和加法:

在线尝试!< /a>

#include <stdint.h>
#include <immintrin.h>

uint32_t add_sat_u32(uint32_t a, uint32_t b) {
    uint32_t r, carry = _addcarry_u32(0, a, b, &r);
    return r | (-carry);
}

Intel x86 MMX 饱和加法指令可用于实现 uint16_t 变体:

尝试一下在线!

#include <stdint.h>
#include <immintrin.h>

uint16_t add_sat_u16(uint16_t a, uint16_t b) {
    return _mm_cvtsi64_si32(_mm_adds_pu16(
        _mm_cvtsi32_si64(a),
        _mm_cvtsi32_si64(b)
    ));
}

我没有提到ARM解决方案,因为它可以通过其他答案中的其他通用解决方案来实现。

I'll add solutions that were not yet mentioned above.

There exists ADC instruction in Intel x86. It is represented as _addcarry_u32() intrinsic function. For ARM there should be similar intrinsic.

Which allows us to implement very fast uint32_t saturated addition for Intel x86:

Try it online!

#include <stdint.h>
#include <immintrin.h>

uint32_t add_sat_u32(uint32_t a, uint32_t b) {
    uint32_t r, carry = _addcarry_u32(0, a, b, &r);
    return r | (-carry);
}

Intel x86 MMX saturated addition instructions can be used to implement uint16_t variant:

Try it online!

#include <stdint.h>
#include <immintrin.h>

uint16_t add_sat_u16(uint16_t a, uint16_t b) {
    return _mm_cvtsi64_si32(_mm_adds_pu16(
        _mm_cvtsi32_si64(a),
        _mm_cvtsi32_si64(b)
    ));
}

I don't mention ARM solution, as it can be implemented by other generic solutions from other answers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文