使用进位标志进行高效 128 位加法

发布于 2024-11-19 16:54:24 字数 783 浏览 10 评论 0原文

我在 C++ 代码的最内部循环中使用 128 位整数计数器。（不相关的背景：实际应用是在规则网格上评估有限差分方程，这涉及到重复递增大整数，甚至 64 位也不够精度，因为小舍入累积足以影响答案。）

我已经表示了整数作为两个 64 位无符号长整型。我现在需要将这些值增加 128 位常量。这并不难，但是您必须手动捕获从低位字到高位字的进位。

我有这样的工作代码：

inline void increment128(unsigned long &hiWord, unsigned long &loWord)
  {
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry
    hiWord += hiAdd;
  }

这是紧凑而简单的代码。有用。

不幸的是，这大约占了我运行时间的 20%。杀手线是那个低字测试。如果我删除它，我显然会得到错误的答案，但运行时开销会从 20% 下降到 4%！所以那个携带测试特别贵！

我的问题：即使作为 GCC 的扩展，C++ 是否会公开硬件进位标志？如果实际编译的指令使用添加使用最后进位指令进行 hiWord 加法，则似乎可以在没有上面的 test-and-add-carry 行的情况下完成添加。有没有办法重写测试和添加进位行以使编译器使用内在操作码？

原文

I'm using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn't enough precision because small rounding accumulates enough to affect the answers.)

I've represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn't hard, but you have to manually catch the carry from the low word to the high word.

I have working code something like this:

inline void increment128(unsigned long &hiWord, unsigned long &loWord)
  {
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry
    hiWord += hiAdd;
  }

This is tight and simple code. It works.

Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!

My question: Does C++ expose the hardware carry flags, even as an extension to GCC?
It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition.
Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄人书 2024-11-26 16:54:24

其实如果你仔细写代码的话，gcc会自动使用进位...

目前的GCC可以将 hiWord += (loWord < loAdd); 优化为 add/adc（x86 的 add-with-carry）。 此优化是在 GCC5.3 中引入的。

在 64 位模式下使用单独的 uint64_t 块：https://godbolt.org/z/S2kGRz。
在 32 位模式下使用 uint32_t 块也是如此：https://godbolt。 org/z/9FC9vc

（编者注：当然，困难的部分是编写一个带有进位和执行的正确全加器；这在 C 语言中很难， GCC 不知道如何优化我所见过的任何内容。）

还相关： https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html 可以为您提供来自无符号或有符号溢出检测的进位。

较旧的 GCC，如 GCC4.5，将在 add 的进位上进行分支或 setc，而不是使用 adc，并且仅使用 adc > (add-with-carry) 如果您使用了 __int128，则在 add 的标志结果上。（或 32 位目标上的 uint64_t）。请参阅 gcc 中是否有 128 位整数？ - 仅在 64 位目标上，自 GCC4.1 起受支持。

我用 gcc -O2 -Wall -Werror -S 编译了这段代码：

void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry                                                                                                             
    hiWord += hiAdd;
}

void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    hiWord += hiAdd;
    hiWord += (loWord < loAdd); // test_and_add_carry                                                                                                               
}

这是increment128_1的程序集：

.cfi_startproc
        movabsq     $-8801131483544218438, %rax
        addq        (%rsi), %rax
        movabsq     $-8801131483544218439, %rdx
        cmpq        %rdx, %rax
        movq        %rax, (%rsi)
        ja  .L5
        movq        (%rdi), %rax
        addq        $1, %rax
.L3:
        movabsq     $6794178679361, %rdx
        addq        %rdx, %rax
        movq        %rax, (%rdi)
        ret

...这是increment128_2的程序集：

        movabsq     $-8801131483544218438, %rax
        addq        %rax, (%rsi)
        movabsq     $6794178679361, %rax
        addq        (%rdi), %rax
        movabsq     $-8801131483544218439, %rdx
        movq        %rax, (%rdi)
        cmpq        %rdx, (%rsi)
        setbe       %dl
        movzbl      %dl, %edx
        leaq        (%rdx,%rax), %rax
        movq        %rax, (%rdi)
        ret

注意第二个版本中缺少条件分支。

[编辑]

另外，引用通常对性能不利，因为 GCC 必须担心别名...通常最好只按值传递内容。考虑：

struct my_uint128_t {
    unsigned long hi;
    unsigned long lo;
};

my_uint128_t increment128_3(my_uint128_t x)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    x.lo += loAdd;
    x.hi += hiAdd + (x.lo < loAdd);
    return x;
}

汇编：

        .cfi_startproc
        movabsq     $-8801131483544218438, %rdx
        movabsq     $-8801131483544218439, %rax
        movabsq     $6794178679362, %rcx
        addq        %rsi, %rdx
        cmpq        %rdx, %rax
        sbbq        %rax, %rax
        addq        %rcx, %rax
        addq        %rdi, %rax
        ret

这实际上是三者中最严格的代码。

...好吧，所以他们实际上都没有自动使用进位:-)。但他们确实避免了条件分支，我敢打赌这是缓慢的部分（因为分支预测逻辑有一半的时间会出错）。

[编辑2]

还有一个，我在进行一些搜索时偶然发现的。您知道 GCC 内置了对 128 位整数的支持吗？

typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));

my_uint128_t increment128_4(my_uint128_t x)
{
    const my_uint128_t hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    return x + (hiAdd << 64) + loAdd;
}

这个程序集几乎是最好的：（

        .cfi_startproc
        movabsq     $-8801131483544218438, %rax
        movabsq     $6794178679361, %rdx
        pushq       %rbx
        .cfi_def_cfa_offset 16
        addq        %rdi, %rax
        adcq        %rsi, %rdx
        popq        %rbx
        .cfi_offset 3, -16
        .cfi_def_cfa_offset 8
        ret

不确定 ebx 的推送/弹出来自哪里，但这仍然不错。）

所有这些都是使用 GCC 4.5.2 的，顺便一提。

Actually gcc will use the carry automatically if you write your code carefully...

Current GCC can optimize hiWord += (loWord < loAdd); into add/adc (x86's add-with-carry). This optimization was introduced in GCC5.3.

With separate uint64_t chunks in 64-bit mode: https://godbolt.org/z/S2kGRz.
And the same thing in 32-bit mode with uint32_t chunks: https://godbolt.org/z/9FC9vc

(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)

Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.

Older GCC, like GCC4.5, will branch or setc on the carry-out from an add, instead of using adc, and only used adc (add-with-carry) on the flag-result from an add if you used __int128. (Or uint64_t on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.

I compiled this code with gcc -O2 -Wall -Werror -S:

void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry                                                                                                             
    hiWord += hiAdd;
}

void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    hiWord += hiAdd;
    hiWord += (loWord < loAdd); // test_and_add_carry                                                                                                               
}

This is the assembly for increment128_1:

.cfi_startproc
        movabsq     $-8801131483544218438, %rax
        addq        (%rsi), %rax
        movabsq     $-8801131483544218439, %rdx
        cmpq        %rdx, %rax
        movq        %rax, (%rsi)
        ja  .L5
        movq        (%rdi), %rax
        addq        $1, %rax
.L3:
        movabsq     $6794178679361, %rdx
        addq        %rdx, %rax
        movq        %rax, (%rdi)
        ret

...and this is the assembly for increment128_2:

        movabsq     $-8801131483544218438, %rax
        addq        %rax, (%rsi)
        movabsq     $6794178679361, %rax
        addq        (%rdi), %rax
        movabsq     $-8801131483544218439, %rdx
        movq        %rax, (%rdi)
        cmpq        %rdx, (%rsi)
        setbe       %dl
        movzbl      %dl, %edx
        leaq        (%rdx,%rax), %rax
        movq        %rax, (%rdi)
        ret

Note the lack of conditional branches in the second version.

[edit]

Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:

struct my_uint128_t {
    unsigned long hi;
    unsigned long lo;
};

my_uint128_t increment128_3(my_uint128_t x)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    x.lo += loAdd;
    x.hi += hiAdd + (x.lo < loAdd);
    return x;
}

Assembly:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rdx
        movabsq     $-8801131483544218439, %rax
        movabsq     $6794178679362, %rcx
        addq        %rsi, %rdx
        cmpq        %rdx, %rax
        sbbq        %rax, %rax
        addq        %rcx, %rax
        addq        %rdi, %rax
        ret

This is actually the tightest code of the three.

...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).

[edit 2]

And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?

typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));

my_uint128_t increment128_4(my_uint128_t x)
{
    const my_uint128_t hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    return x + (hiAdd << 64) + loAdd;
}

The assembly for this one is about as good as it gets:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rax
        movabsq     $6794178679361, %rdx
        pushq       %rbx
        .cfi_def_cfa_offset 16
        addq        %rdi, %rax
        adcq        %rsi, %rdx
        popq        %rbx
        .cfi_offset 3, -16
        .cfi_def_cfa_offset 8
        ret

(Not sure where the push/pop of ebx came from, but this is still not bad.)

All of these are with GCC 4.5.2, by the way.

回复收藏 0 原文

紫﹏色ふ单纯 2024-11-26 16:54:24

当然，最好的答案是使用内置的 __int128_t 支持。

或者，使用内联汇编。我更喜欢使用命名参数形式：

__asm("add %[src_lo], %[dst_lo]\n"
      "adc %[src_hi], %[dst_hi]"
      : [dst_lo] "+&r" (loWord), [dst_hi] "+r" (hiWord)
      : [src_lo] "erm" (loAdd), [src_hi] "erm" (hiAdd)
      : );

loWord 被标记为 early clobber 操作数，因为它是在读取其他一些操作数之前写入的。这可以避免 hiAdd = loWord 的错误代码，因为它将阻止 gcc 使用相同的寄存器来保存两者。不过，它确实会阻止编译器在 loAdd = loWord 情况下使用相同的寄存器，但这是安全的。

正如那个早期问题所指出的那样，内联汇编确实很容易出错（以难以调试的方式，只会在对其内联的代码进行一些更改后才会导致问题）。

假定 x86 和 x86-64 内联 asm 会破坏标志，因此不需要显式的“cc”破坏。

The best answer, of course, is to use the built-in __int128_t support.

Alternatively, use an inline asm. I prefer to use the named-argument form:

__asm("add %[src_lo], %[dst_lo]\n"
      "adc %[src_hi], %[dst_hi]"
      : [dst_lo] "+&r" (loWord), [dst_hi] "+r" (hiWord)
      : [src_lo] "erm" (loAdd), [src_hi] "erm" (hiAdd)
      : );

loWord is flagged as an early clobber operand, because it's written before some of the other operands are read. This avoids wrong code for hiAdd = loWord, because it will stop gcc from using the same register to hold both. It does stop the compiler from using the same register for the loAdd = loWord case, though, where it is safe.

As that early-clobber question points out, inline asm is really easy to get wrong (in hard-to-debug ways which only cause trouble up after some change to the code it's inlined into).

x86 and x86-64 inline asm is assumed to clobber the flags, so an explicit "cc" clobber isn't needed.

回复收藏 0 原文

~没有更多了~