破坏红色区域的内联汇编
我正在编写一个加密程序,其核心(一个宽乘法例程)是用 x86-64 汇编语言编写的,这既是为了速度,也是因为它广泛使用像 adc
这样的指令,这些指令不容易从 C 访问我不想内联这个函数,因为它很大并且在内循环中被调用了几次。
理想情况下,我还想为此函数定义一个自定义调用约定,因为它在内部使用所有寄存器(rsp
除外),不会破坏其参数,并在寄存器中返回。现在,它已适应 C 调用约定,但这当然会使其变慢(大约 10%)。
为了避免这种情况,我可以使用 asm("call %Pn" : ... : my_function... : "cc", all the registers);
来调用它,但是有没有办法告诉 GCC调用指令与堆栈混淆?否则,GCC 只会将所有这些寄存器放入红色区域,而顶部的寄存器将被破坏。我可以使用 -mno-red-zone 编译整个模块,但我更喜欢一种方法来告诉 GCC,比如说,红色区域的前 8 个字节将被破坏,这样它就不会在其中放置任何内容。
I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc
that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp
), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers);
but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
从你最初的问题来看,我没有意识到 gcc 限制红区对叶函数的使用。我不认为这是 x86_64 ABI 所要求的,但对于编译器来说这是一个合理的简化假设。在这种情况下,您只需将调用汇编例程的函数设置为非叶函数即可进行编译:
GCC 无法判断
global
是否为 true,因此无法优化该调用到other()
所以was_leaf()
不再是叶函数了。我编译了这个(使用更多触发堆栈使用的代码)并观察到,作为叶子,它没有移动%rsp
,并且经过修改显示它确实移动了。我还尝试简单地在叶子中分配超过 128 个字节(只是
char buf[150]
),但我很惊讶地发现它只做了部分减法:如果我将叶子失败的代码放回变成
subq $160, %rsp
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
GCC can't tell if
global
will be true, so it can't optimize away the call toother()
sowas_leaf()
is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move%rsp
and with the modification shown it did.I also tried simply allocating more than 128 bytes (just
char buf[150]
) in a leaf but I was shocked to see it only did a partial subtraction:If I put the leaf-defeating code back in that becomes
subq $160, %rsp
最大性能方法可能是在 asm 中编写整个内部循环(包括
call
指令,如果确实值得展开但不内联)。如果完全内联导致太多 uop,那么当然是合理的。其他地方的缓存未命中)。无论如何,让 C 调用包含优化循环的 asm 函数。
顺便说一句,破坏所有寄存器会使 gcc 很难做出一个非常好的循环,因此您很可能会通过自己优化整个循环而取得成功。 (例如,可以在寄存器中保留一个指针,在内存中保留一个结束指针,因为
cmp mem,reg
仍然相当有效)。看一下代码 gcc/clang 环绕
asm
语句,该语句修改数组元素(在 Godbolt):clang 将一个单独的计数器向下计数到零。但它使用 load / add -1 / store 而不是内存目标
add [mem], -1
/jnz
。如果您自己在 asm 中编写整个循环,而不是将热循环的那部分留给编译器,您可能可以做得更好。
考虑使用一些 XMM 寄存器进行整数运算,以减少寄存器压力如果可能的话,使用整数寄存器。在 Intel CPU 上,在 GP 和 XMM 寄存器之间移动仅花费 1 个 ALU uop,且延迟为 1c。 (在 AMD 上仍然是 1 uop,但延迟更高,尤其是在 Bulldozer 系列上)。在 XMM 寄存器中执行标量整数内容并没有差多少,如果总 uop 吞吐量是您的瓶颈,或者它节省的溢出/重新加载比其成本更多,则可能是值得的。
但当然,XMM 对于循环计数器来说不太可行(
paddd
/pcmpeq
/pmovmskb
/cmp
/jcc
或psubd
/ptest
/jcc
与sub [mem], 1
相比并不好> / jcc),或对于指针,或用于扩展精度算术(即使在 64 位整数寄存器不可用的 32 位模式下,通过比较手动执行进位并与另一个 paddq 进位输入也很糟糕)。如果加载/存储微指令没有遇到瓶颈,通常最好溢出/重新加载到内存而不是 XMM 寄存器。如果您还需要从循环外部调用该函数(清理或其他),请编写包装器或使用 add $-128, %rsp ;称呼 ; sub $-128, %rsp 以保留这些版本中的红色区域。 (请注意,
-128
可编码为imm8
,但+128
则不然。)在 C 函数中包含实际的函数调用不会不过,这并不一定可以安全地假设红色区域未被使用。 (编译器可见)函数调用之间的任何溢出/重新加载都可以使用红色区域,因此破坏
asm
语句中的所有寄存器很可能会触发该行为。如果您想依赖于编译器特定的行为,您可以在热循环之前调用(使用常规 C)非内联函数。使用当前的 gcc / clang,这将使它们保留足够的堆栈空间,因为它们无论如何都必须调整堆栈(以在
调用
之前对齐rsp
)。这根本不是面向未来的,但应该可行。GNU C 有一个
__attribute__((target(" options")))
x86 函数属性,但它不能用于任意选项,并且-mno-red- zone
不是其中之一您可以在每个函数的基础上进行切换,或者使用编译单元中的#pragma GCC target(“options”)进行切换。您可以使用类似的内容
,但不能使用
__attribute__(( target("mno-red-zone") ))
。有一个
# pragma GCC optimize
和optimize
函数属性(两者都不适合生产代码),但是#pragma GCC Optimize ("-mno-red-zone")
也不起作用。我认为这个想法是让一些重要的函数即使在调试版本中也可以使用-O2
进行优化。您可以设置-f
选项或-O
。不过,您可以将该函数本身放入一个文件中,并使用
-mno-red-zone
编译该编译单元。 (希望 LTO 不会破坏任何东西......)The max-performance way might be to write the whole inner loop in asm (including the
call
instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because
cmp mem,reg
is still fairly efficient).Have a look at the code gcc/clang wrap around an
asm
statement that modifies an array element (on Godbolt):clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination
add [mem], -1
/jnz
.You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (
paddd
/pcmpeq
/pmovmskb
/cmp
/jcc
orpsubd
/ptest
/jcc
are not great compared tosub [mem], 1
/ jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with anotherpaddq
sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use
add $-128, %rsp ; call ; sub $-128, %rsp
to preserve the red-zone in those versions. (Note that-128
is encodeable as animm8
but+128
isn't.)Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an
asm
statement is quite likely to trigger that behaviour.If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align
rsp
before acall
). This is not future-proof at all, but should happen to work.GNU C has an
__attribute__((target("options")))
x86 function attribute, but it's not usable for arbitrary options, and-mno-red- zone
is not one of the ones you can toggle on a per-function basis, or with#pragma GCC target ("options")
within a compilation unit.You can use stuff like
but not
__attribute__(( target("mno-red-zone") ))
.There's a
#pragma GCC optimize
and anoptimize
function-attribute (both of which are not intended for production code), but#pragma GCC optimize ("-mno-red-zone")
doesn't work either. I think the idea is to let some important functions be optimized with-O2
even in debug builds. You can set-f
options or-O
.You could put the function in a file by itself and compile that compilation unit with
-mno-red-zone
, though. (And hopefully LTO will not break anything...)难道您不能通过在函数入口处将堆栈指针移动 128 个字节来修改汇编函数以满足 x86-64 ABI 中信号的要求吗?
或者,如果您指的是返回指针本身,请将移位放入调用宏中(因此
sub %rsp; call...
)Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so
sub %rsp; call...
)不确定,但查看 函数属性的 GCC 文档,我发现 < code>stdcall 可能感兴趣的函数属性。
我仍然想知道你发现你的 asm 调用版本有什么问题。如果只是为了美观,您可以将其转换为宏或内联函数。
Not sure but looking at GCC documentation for function attributes, I found the
stdcall
function attribute which might be of interest.I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
创建一个用 C 编写的虚拟函数,除了调用内联汇编之外什么都不做,怎么样?
What about creating a dummy function that is written in C and does nothing but call the inline assembly?