破坏红色区域的内联汇编

发布于 2024-11-15 18:12:06 字数 500 浏览 6 评论 0原文

我正在编写一个加密程序，其核心（一个宽乘法例程）是用 x86-64 汇编语言编写的，这既是为了速度，也是因为它广泛使用像 adc 这样的指令，这些指令不容易从 C 访问我不想内联这个函数，因为它很大并且在内循环中被调用了几次。

理想情况下，我还想为此函数定义一个自定义调用约定，因为它在内部使用所有寄存器（rsp 除外），不会破坏其参数，并在寄存器中返回。现在，它已适应 C 调用约定，但这当然会使其变慢（大约 10%）。

为了避免这种情况，我可以使用 asm("call %Pn" : ... : my_function... : "cc", all the registers); 来调用它，但是有没有办法告诉 GCC调用指令与堆栈混淆？否则，GCC 只会将所有这些寄存器放入红色区域，而顶部的寄存器将被破坏。我可以使用 -mno-red-zone 编译整个模块，但我更喜欢一种方法来告诉 GCC，比如说，红色区域的前 8 个字节将被破坏，这样它就不会在其中放置任何内容。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别低头，皇冠会掉 2024-11-22 18:12:06

从你最初的问题来看，我没有意识到 gcc 限制红区对叶函数的使用。我不认为这是 x86_64 ABI 所要求的，但对于编译器来说这是一个合理的简化假设。在这种情况下，您只需将调用汇编例程的函数设置为非叶函数即可进行编译：

int global;

was_leaf()
{
    if (global) other();
}

GCC 无法判断 global 是否为 true，因此无法优化该调用到 other() 所以 was_leaf() 不再是叶函数了。我编译了这个（使用更多触发堆栈使用的代码）并观察到，作为叶子，它没有移动 %rsp ，并且经过修改显示它确实移动了。

我还尝试简单地在叶子中分配超过 128 个字节（只是 char buf[150]），但我很惊讶地发现它只做了部分减法：

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $40, %rsp
    movb    $7, -155(%rbp)

如果我将叶子失败的代码放回变成subq $160, %rsp

From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:

int global;

was_leaf()
{
    if (global) other();
}

GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.

I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $40, %rsp
    movb    $7, -155(%rbp)

If I put the leaf-defeating code back in that becomes subq $160, %rsp

回复收藏 0 原文

╰◇生如夏花灿烂 2024-11-22 18:12:06

最大性能方法可能是在 asm 中编写整个内部循环（包括 call 指令，如果确实值得展开但不内联）。如果完全内联导致太多 uop，那么当然是合理的。其他地方的缓存未命中）。

无论如何，让 C 调用包含优化循环的 asm 函数。

顺便说一句，破坏所有寄存器会使 gcc 很难做出一个非常好的循环，因此您很可能会通过自己优化整个循环而取得成功。（例如，可以在寄存器中保留一个指针，在内存中保留一个结束指针，因为cmp mem,reg仍然相当有效）。

看一下代码 gcc/clang 环绕 asm 语句，该语句修改数组元素（在 Godbolt)：

void testloop(long *p, long count) {
  for (long i = 0 ; i < count ; i++) {
    asm("  #    XXX  asm operand in %0"
    : "+r" (p[i])
    :
    : // "rax",
     "rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
      "r8", "r9", "r10", "r11", "r12","r13","r14","r15"
    );
  }
}

#gcc7.2 -O3 -march=haswell

    push registers and other function-intro stuff
    lea     rcx, [rdi+rsi*8]      ; end-pointer
    mov     rax, rdi
   
    mov     QWORD PTR [rsp-8], rcx    ; store the end-pointer
    mov     QWORD PTR [rsp-16], rdi   ; and the start-pointer

.L6:
    # rax holds the current-position pointer on loop entry
    # also stored in [rsp-16]
    mov     rdx, QWORD PTR [rax]
    mov     rax, rdx                 # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx

         XXX  asm operand in rax

    mov     rbx, QWORD PTR [rsp-16]   # reload the pointer
    mov     QWORD PTR [rbx], rax
    mov     rax, rbx            # another weird missed-optimization (lea rax, [rbx+8])
    add     rax, 8
    mov     QWORD PTR [rsp-16], rax
    cmp     QWORD PTR [rsp-8], rax
    jne     .L6

  # cleanup omitted.

clang 将一个单独的计数器向下计数到零。但它使用 load / add -1 / store 而不是内存目标 add [mem], -1 / jnz。

如果您自己在 asm 中编写整个循环，而不是将热循环的那部分留给编译器，您可能可以做得更好。

考虑使用一些 XMM 寄存器进行整数运算，以减少寄存器压力如果可能的话，使用整数寄存器。在 Intel CPU 上，在 GP 和 XMM 寄存器之间移动仅花费 1 个 ALU uop，且延迟为 1c。（在 AMD 上仍然是 1 uop，但延迟更高，尤其是在 Bulldozer 系列上）。在 XMM 寄存器中执行标量整数内容并没有差多少，如果总 uop 吞吐量是您的瓶颈，或者它节省的溢出/重新加载比其成本更多，则可能是值得的。

但当然，XMM 对于循环计数器来说不太可行（paddd/pcmpeq/pmovmskb/cmp/jcc 或 psubd/ptest/jcc 与 sub [mem], 1 相比并不好> / jcc)，或对于指针，或用于扩展精度算术（即使在 64 位整数寄存器不可用的 32 位模式下，通过比较手动执行进位并与另一个 paddq 进位输入也很糟糕）。如果加载/存储微指令没有遇到瓶颈，通常最好溢出/重新加载到内存而不是 XMM 寄存器。

如果您还需要从循环外部调用该函数（清理或其他），请编写包装器或使用 add $-128, %rsp ;称呼 ; sub $-128, %rsp 以保留这些版本中的红色区域。（请注意，-128 可编码为 imm8，但 +128 则不然。）

在 C 函数中包含实际的函数调用不会不过，这并不一定可以安全地假设红色区域未被使用。（编译器可见）函数调用之间的任何溢出/重新加载都可以使用红色区域，因此破坏 asm 语句中的所有寄存器很可能会触发该行为。

// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
  //cryptofunc(1);  // gcc/clang don't use the redzone after this (not future-proof)

  volatile int tmp = 1;
  (void)tmp;
  cryptofunc(1);  // but gcc will use the redzone before a tailcall
}

# gcc7.2 -O3 output
    mov     edi, 1
    mov     DWORD PTR [rsp-12], 1
    mov     eax, DWORD PTR [rsp-12]
    jmp     cryptofunc(long)

如果您想依赖于编译器特定的行为，您可以在热循环之前调用（使用常规 C）非内联函数。使用当前的 gcc / clang，这将使它们保留足够的堆栈空间，因为它们无论如何都必须调整堆栈（以在 调用 之前对齐 rsp）。这根本不是面向未来的，但应该可行。

GNU C 有一个 __attribute__((target(" options"))) x86 函数属性，但它不能用于任意选项，并且 -mno-red- zone 不是其中之一您可以在每个函数的基础上进行切换，或者使用编译单元中的#pragma GCC target（“options”）进行切换。

您可以使用类似的内容

__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
  ...
}

，但不能使用 __attribute__(( target("mno-red-zone") ))。

有一个 # pragma GCC optimize 和 optimize 函数属性（两者都不适合生产代码），但是#pragma GCC Optimize ("-mno-red-zone") 也不起作用。我认为这个想法是让一些重要的函数即使在调试版本中也可以使用 -O2 进行优化。您可以设置-f选项或-O。

不过，您可以将该函数本身放入一个文件中，并使用 -mno-red-zone 编译该编译单元。（希望 LTO 不会破坏任何东西......）

The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).

Anyway, have C call an asm function containing your optimized loop.

BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).

Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):

void testloop(long *p, long count) {
  for (long i = 0 ; i < count ; i++) {
    asm("  #    XXX  asm operand in %0"
    : "+r" (p[i])
    :
    : // "rax",
     "rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
      "r8", "r9", "r10", "r11", "r12","r13","r14","r15"
    );
  }
}

#gcc7.2 -O3 -march=haswell

    push registers and other function-intro stuff
    lea     rcx, [rdi+rsi*8]      ; end-pointer
    mov     rax, rdi
   
    mov     QWORD PTR [rsp-8], rcx    ; store the end-pointer
    mov     QWORD PTR [rsp-16], rdi   ; and the start-pointer

.L6:
    # rax holds the current-position pointer on loop entry
    # also stored in [rsp-16]
    mov     rdx, QWORD PTR [rax]
    mov     rax, rdx                 # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx

         XXX  asm operand in rax

    mov     rbx, QWORD PTR [rsp-16]   # reload the pointer
    mov     QWORD PTR [rbx], rax
    mov     rax, rbx            # another weird missed-optimization (lea rax, [rbx+8])
    add     rax, 8
    mov     QWORD PTR [rsp-16], rax
    cmp     QWORD PTR [rsp-8], rax
    jne     .L6

  # cleanup omitted.

clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.

You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.

Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.

But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.

If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)

Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.

// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
  //cryptofunc(1);  // gcc/clang don't use the redzone after this (not future-proof)

  volatile int tmp = 1;
  (void)tmp;
  cryptofunc(1);  // but gcc will use the redzone before a tailcall
}

# gcc7.2 -O3 output
    mov     edi, 1
    mov     DWORD PTR [rsp-12], 1
    mov     eax, DWORD PTR [rsp-12]
    jmp     cryptofunc(long)

If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.

GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.

You can use stuff like

__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
  ...
}

but not __attribute__(( target("mno-red-zone") )).

There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.

You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)

回复收藏 0 原文