我关注Linux 64系统中的“ Benter X64组装编程”一书。我正在使用NASM和GCC。
在有关浮点操作的一章中,该书指定了以下代码添加2个浮点数。在本书和其他在线资源中,我已经阅读了登记册指定要使用的XMM寄存器的数量。
本书中的代码如下:
extern printf
section .data
num1 dq 9.0
num2 dq 73.0
fmt db "The numbers are %f and %f",10,0
f_sum db "%f + %f = %f",10,0
section .text
global main
main:
push rbp
mov rbp, rsp
printn:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, fmt
mov rax, 2 ;for printf rax specifies amount of xmm registers
call printf
sum:
movsd xmm2, [num1]
addsd xmm2, [num2]
printsum:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, f_sum
mov rax, 3
call printf
按预期工作。
然后,在最后一个 printf
调用之前,我尝试更改
mov rax, 3
,
mov rax, 1
然后重新组装并运行程序。
我期望有一些不同的胡说八道,但是我很惊讶输出完全相同。 printf
正确输出3个浮点值:
数字为9.000000和73.000000
9.000000 + 73.000000 = 82.000000
我想当 printf
期望使用多个XMM寄存器时,就会有某种替代,只要RAX不是0,它将使用连续的XMM寄存器。我已经在调用惯例和nasm手册时搜索了一个解释,但没有找到一本。
这样可以起作用的原因是什么?
I am following the book "Beginning x64 Assembly Programming", in Linux 64 system. I am using NASM and gcc.
In the chapter about floating point operations the book specifies the below code for adding 2 float numbers. In the book, and other online sources, I have read that register RAX specifies the number of XMM registers to be used, according to calling conventions.
The code in the book goes as follows:
extern printf
section .data
num1 dq 9.0
num2 dq 73.0
fmt db "The numbers are %f and %f",10,0
f_sum db "%f + %f = %f",10,0
section .text
global main
main:
push rbp
mov rbp, rsp
printn:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, fmt
mov rax, 2 ;for printf rax specifies amount of xmm registers
call printf
sum:
movsd xmm2, [num1]
addsd xmm2, [num2]
printsum:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, f_sum
mov rax, 3
call printf
That works as expected.
Then, before the last printf
call, I tried changing
mov rax, 3
for
mov rax, 1
Then I reassembled and ran the program.
I was expecting some different nonsense output, but I was surprised the output was exactly the same. printf
outputs the 3 float values correctly:
The numbers are 9.000000 and 73.000000
9.000000 + 73.000000 = 82.000000
I suppose there is some kind of override when printf
is expecting the use of several XMM registers, and as long as RAX is not 0, it will use consecutive XMM registers. I have searched for an explanation in calling conventions and NASM manual, but didn't find one.
What is the reason why this works?
发布评论
评论(1)
x86-64 sysv abi 允许< / em>实现仅保存指定的XMM reg的确切数量,但是当前实现仅检查零 /非零,因为这是有效的,尤其是对于AL = 0常见情况。
如果您通过Al 1 中的数字低于XMM寄存器Args的实际数量,或高于8的数字,您将违反ABI,而仅此实现细节可以阻止您的代码从破裂。 (即它“碰巧起作用”,但不能由任何标准或文档保证,也不适用于其他一些实际实现,例如使用GCC4.5构建的较旧的GNU/Linux发行版)
this Q&amp; a 只需检查
al!= 0
,与旧的GLIBC构建,该构建将跳跃目标计算为Moveaps
商店的序列。 (Q&amp; a是关于al&gt; 8
时的代码破裂,使计算机跳转到应该不应该的地方。)为什么eax包含矢量参数的数量?引用ABI DOC,并显示ICC代码,其中ICC代码是类似地计算出来的使用与旧GCC相同的说明跳跃。
glibc的
printf
实现是从c源编译的,通常由gcc。编译时,当Modern GCC编译像printf这样的variadic函数时,它使ASM仅检查零与非 - 非 - 零al,将所有8个通过XMM寄存的所有8个ARG寄存在堆栈上的数组,如果非零。GCC4.5及更早的实际 did 使用al中的数字进行计算的跳跃成
Movapaps
商店的序列,以实际上保存尽可能多的XMM regs。Nate的简单示例来自 Godbolt 与GCC4.5对GCC11相比,与链接的答案相同,毫不奇怪,与旧/新Glibc(由GCC建造)的链接答案相同。此函数唯一曾经使用
va_arg(v,double);
,从来没有整数类型,因此它不会转移传入的rdi ... r9在任何地方,与不同, printf
。它是叶子功能,因此可以使用红色区域(RSP以下128个字节)。相对于
GCC大概改变了策略,因为计算的跳跃需要更多的静态代码大小(I-CACHE足迹),而测试/JZ比间接跳跃更容易预测。更重要的是,在Common Al = 0(no-XMM)情况下执行的UOPS 2 。即使对于Al = 1最坏情况,也没有更多(7 Dead
Movaps
存储,但没有完成计算分支目标的工作)。相关Q&amp; as:
_start (取决于动态链钩以获取libc启动函数)。
在谈论违反呼叫违规的同时,半相关:
printf
at al = 0,使用> Movapap
除了将XMM arg to stack to stack)脚注1:al,而不是rax之外,这是
x86-64系统v abi doc指定variadic函数必须仅针对REG数量的AL来看; RAX的高7个字节可容纳垃圾。
MOV EAX,3
是设置Al的有效方法通过编写部分寄存器可能的错误依赖性,尽管机器代码大小(5个字节)大于mov al,3
(2个字节)。 clang通常使用mov al,3
。来自ABI DOC的要点,请参见为什么EAX包含EAX号码矢量参数吗?有关更多上下文:
(最后一点是过时的:XMM regs广泛用于memcpy/memset,并将其嵌入到零输入的小阵列/结构中。如此之多,以至于Linux使用“急切”的fpu save/Restore/Restore在上下文开关上,而不是“懒惰”的地方。首次使用XMM Reg故障。)
Al&lt; = 8的ABI保证允许计算成问题的实现省略界限检查。 (类似地, C ++标准是否允许非专业化的布尔崩溃程序?是的,可以假定ABI违规行为不发生,例如在这种情况下制作代码。)
脚注2:两种策略的效率:
较小的静态代码大小(I-Cache足迹)总是一件好事,而Al!= 0策略对此有利。
最重要的是,对于AL == 0情况下执行的总说明更少。
printf
不是唯一的variadic函数;sscanf
并不罕见,它永远不会乘以FP Args(仅指针)。如果编译器可以看到函数永远不会使用fp参数使用va_arg
,则它将完全省略保存,使这一点失调,但是SCANF/PRINTF函数通常被用作vfscanf的包装器
/vfprintf
呼叫,因此编译器不参见请参阅,它看到va_list
被传递给另一个功能,以便它具有保存一切。 (我认为人们编写自己的变异功能是相当罕见的,因此在许多程序中,唯一对variadic函数的呼叫将是图书馆函数。 ; 8但是,由于管道和商店的缓冲区宽,但非零案例,与这些商店的发生同时开始进行真实的工作。
计算和进行间接跳跃采用5个总指令,而不是计算
lea rsi,-136 [rsp]
和lea rdx,39 [rsp]
。在Movaps商店之后,Test/JZ策略还可以执行或类似的操作,作为va_arg
代码的设置看着堆栈。我也不要计算
sub rsp,48
;无论哪种方式,除非您也使XMM-Save-Save-save-suble尺寸变量变量,否则这是必要的,或者仅节省每个XMM reg的低半部分,因此8x 8 B = 64字节可以适合红色区域。从理论上讲,variadic函数可以在XMM reg中采用16字节__ m128d
arg,因此GCC使用Movapaps 而不是
的原因之一:每个movlps
。 (我不确定Glibc printf是否有任何转换)。在像实际printf这样的非叶片功能中,您始终需要保留更多的空间,而不是使用红色区域。 (这是lea rdx,39 [rsp]
Movepaps 需要完全4个字节,因此编译器的生成食谱该代码必须确保其偏移量在
[reg+disp8]
地址模式的[-128,+127]范围内,而不是0
的ASM语法来强制使用更长的指令。要使用特殊 “ https://agner.org/optimize/” rel =“ nofollow noreferrer”> https://agner.org/optimize/ ),我们通常会在128-byte区域下方触摸堆栈空间无论如何
。 以保存XMM ARG。
5个整数ALU/跳跃,一个动作) , 。 1个宏观融合的测试和分支,8个动作存储。
这就是计算机套件版本的最佳案例:使用更多的XMM args,它仍然运行5个指令来计算跳跃目标,但必须运行更多的动作指令。测试/JZ版本始终是9个UOPS。因此,动态UOP计数的收支平衡点(实际执行,与坐在内存中占用I-Cache足迹)是4 xmm args,这可能很少见,但它具有其他优势。特别是在Al == 0的情况下,它是5 vs. 1。
测试/JZ分支总是在任何数量的XMM args a grands a grow对于
printf(“%f%f \ n”,...)
vs“%f \ n”
是不同的。在计算机跳跃版本中的5个说明中的3个指令中的3个(不包括JMP)形成了来自传入AL的依赖关系链,因此可以检测到错误预测之前的更多周期(即使该链可能以a
在通话之前,MOV EAX,1 )。但是,垃圾场中的“额外”说明只是XMM1..7中一些永不重新加载且不属于任何依赖关系链的XMM1..7的死商店。只要商店的缓冲区和Rob/RS可以吸收它们,就可以在其闲暇时努力工作。
(公平地说,他们会捆绑一段时间的商店数据和商店 - 地址执行单元,这意味着后来的商店也不会尽快准备好商店。与负载相同的执行单元,以后的负载可以被那些uops拖延这些执行单元。 使用这样的简单地址模式。
, (大多数变量函数都按顺序进行arg)。如果有多个XMM args,那么计算出的方式才可以准备从该商店存储前面,直到几个周期之后。但是对于具有Al = 1的情况,这是唯一的XMM商店,并且没有其他工作将负载/存储 - 地址执行单元绑定,而少量ARG可能更常见。
与较小的代码足迹相比,这些原因中的大多数确实很小,并且针对AL == 0案例执行的说明更少。 (对于我们中的某些人来说)是在现代简单方式的上/下面思考,这很有趣,表明即使在最坏的情况下,这也不是问题。
The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.
If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)
This Q&A shows a current build of glibc printf which just checks for
AL!=0
, vs. an old build of glibc which computes a jump target into a sequence ofmovaps
stores. (That Q&A is about that code breaking whenAL>8
, making the computed jump go somewhere it shouldn't.)Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.
Glibc's
printf
implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of
movaps
stores, to only actually save as many XMM regs as necessary.Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses
va_arg(v, double);
, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlikeprintf
. And it's a leaf function so it can use the red-zone (128 bytes below RSP).vs.
GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead
movaps
stores but no work done computing a branch target).Related Q&As:
_start
(depending on dynamic-linker hooks to get libc startup functions called).Semi-related while we're talking about calling-convention violations:
printf
with AL=0, usingmovaps
somewhere other than dumping XMM args to the stack)Footnote 1: AL, not RAX, is what matters
The x86-64 System V ABI doc specifies that variadic functions must look only at AL for the number of regs; the high 7 bytes of RAX are allowed to hold garbage.
mov eax, 3
is an efficient way to set AL, avoiding possible false dependencies from writing a partial register, although it is larger in machine-code size (5 bytes) thanmov al,3
(2 bytes). clang typically usesmov al, 3
.Key points from the ABI doc, see Why does eax contain the number of vector parameters? for more context:
(That last point is obsolete: XMM regs are widely used for memcpy/memset and inlined to zero-init small arrays / structs. So much so that Linux uses "eager" FPU save/restore on context switches, not "lazy" where the first use of an XMM reg faults.)
This ABI guarantee of AL <= 8 is what allows computed-jump implementations to omit bounds-checking. (Similarly, Does the C++ standard allow for an uninitialized bool to crash a program? yes, ABI violations can be assumed not to happen, e.g. by making code that would crash in that case.)
Footnote 2: efficiency of the two strategies
Smaller static code-size (I-cache footprint) is always a good thing, and the AL!=0 strategy has that in its favour.
Most importantly, fewer total instructions executed for the AL==0 case.
printf
isn't the only variadic function;sscanf
is not rare, and it never takes FP args (only pointers). If a compiler can see that a function never usesva_arg
with an FP argument, it omits saving entirely, making this point moot, but the scanf/printf functions are normally implemented as wrappers for thevfscanf
/vfprintf
calls, so the compiler doesn't see that, it sees ava_list
being passed to another function so it has to save everything. (I think it's fairly rare for people to write their own variadic functions, so in a lot of programs the only calls to variadic functions will be to library functions.)Out-of-order exec can chew through the dead stores just fine for AL<8 but non-zero cases, thanks to wide pipelines and store buffers, getting started on the real work in parallel with those stores happening.
Computing and doing the indirect jump takes 5 total instructions, not counting the
lea rsi, -136[rsp]
andlea rdx, 39[rsp]
. The test/jz strategy also does those or similar, just after the movaps stores, as setup for theva_arg
code which has to figure out when it gets to the end of the register-save area and switch to looking at stack args.I'm also not counting the
sub rsp, 48
either; that's necessary either way unless you make the XMM-save-area size variable as well, or only save the low half of each XMM reg so 8x 8 B = 64 bytes would fit in the red-zone. In theory variadic functions can take a 16-byte__m128d
arg in an XMM reg so GCC usesmovaps
instead ofmovlps
. (I'm not sure if glibc printf has any conversions that would take one). And in non-leaf functions like actual printf, you'd always need to reserve more space instead of using the red-zone. (This is one reason for thelea rdx, 39[rsp]
in the computed-jump version: everymovaps
needs to be exactly 4 bytes, so the compiler's recipe for generating that code has to make sure their offsets are in the [-128,+127] range of a[reg+disp8]
addressing mode, and not0
unless GCC was going to use special asm syntax to force a longer instruction there.Almost all x86-64 CPUs run 16-byte stores as a single micro-fused uop (only crusty old AMD K8 and Bobcat splitting into 8-byte halves; see https://agner.org/optimize/), and we'd usually be touching stack space below that 128-byte area anyway. (Also, the computed-jump strategy stores to the bottom itself, so it doesn't avoid touching that cache line.)
So for a function with one XMM arg, the computed-jump version takes 6 total single-uop instructions (5 integer ALU/jump, one movaps) to get the XMM arg saved.
The test/jz version takes 9 total uops (10 instructions but test/jz macro-fuse in 64-bit mode on Intel since Nehalem, AMD since Bulldozer IIRC). 1 macro-fused test-and-branch, and 8 movaps stores.
And that's the best case for the computed-jump version: with more xmm args, it still runs 5 instructions to compute the jump target, but has to run more movaps instructions. The test/jz version is always 9 uops. So the break-even point for dynamic uop count (actually executed, vs. sitting there in memory taking up I-cache footprint) is 4 XMM args which is probably rare, but it has other advantages. Especially in the AL == 0 case where it's 5 vs. 1.
The test/jz branch always goes to the same place for any number of XMM args except zero, making it easier to predict than an indirect branch that's different for
printf("%f %f\n", ...)
vs"%f\n"
.3 of the 5 instructions (not including the jmp) in the computed-jump version form a dependency chain from the incoming AL, making it take that many more cycles before a misprediction can be detected (even though the chain probably started with a
mov eax, 1
right before the call). But the "extra" instructions in the dump-everything strategy are just dead stores of some of XMM1..7 that never get reloaded and aren't part of any dependency chain. As long as the store buffer and ROB/RS can absorb them, out-of-order exec can work on them at its leisure.(To be fair, they will tie up the store-data and store-address execution units for a while, meaning that later stores won't be ready for store-forwarding as soon either. And on CPUs where store-address uops run on the same execution units as loads, later loads can be delayed by those store uops hogging those execution units. Fortunately, modern CPUs have at least 2 load execution units, and Intel from Haswell to Skylake can run store-address uops on any of 3 ports, with simple addressing modes like this. Ice Lake has 2 load / 2 store ports with no overlap.)
The computed jump version has save XMM0 last, which is likely to be the first arg reloaded. (Most variadic functions go through their args in order). If there are multiple XMM args, the computed-jump way won't be ready to store-forward from that store until a couple cycles later. But for cases with AL=1 that's the only XMM store, and no other work tying up load/store-address execution units, and small numbers of args are probably more common.
Most of these reasons are really minor compared to the advantage of smaller code footprint, and fewer instructions executed for the AL==0 case. It's just fun (for some of us) to think through the up/down sides of the modern simple way, to show that even in its worst case, it's not a problem.