为什么printf仍然可以使用RAX低于XMM寄存器中的FP ARG数量?

发布于 2025-01-23 13:49:58 字数 1160 浏览 0 评论 0 原文

我关注Linux 64系统中的“ Benter X64组装编程”一书。我正在使用NASM和GCC。
在有关浮点操作的一章中,该书指定了以下代码添加2个浮点数。在本书和其他在线资源中,我已经阅读了登记册指定要使用的XMM寄存器的数量。
本书中的代码如下:

extern printf
section .data
num1        dq  9.0
num2        dq  73.0
fmt     db  "The numbers are %f and %f",10,0
f_sum       db  "%f + %f = %f",10,0

section .text
global main
main:
    push rbp
    mov rbp, rsp
printn:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, fmt
    mov rax, 2      ;for printf rax specifies amount of xmm registers
    call printf

sum:
    movsd xmm2, [num1]
    addsd xmm2, [num2]
printsum:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, f_sum
    mov rax, 3
    call printf

按预期工作。
然后,在最后一个 printf 调用之前,我尝试更改

mov rax, 3

mov rax, 1

然后重新组装并运行程序。

我期望有一些不同的胡说八道,但是我很惊讶输出完全相同。 printf 正确输出3个浮点值:

 数字为9.000000和73.000000
9.000000 + 73.000000 = 82.000000
 

我想当 printf 期望使用多个XMM寄存器时,就会有某种替代,只要RAX不是0,它将使用连续的XMM寄存器。我已经在调用惯例和nasm手册时搜索了一个解释,但没有找到一本。

这样可以起作用的原因是什么?

I am following the book "Beginning x64 Assembly Programming", in Linux 64 system. I am using NASM and gcc.
In the chapter about floating point operations the book specifies the below code for adding 2 float numbers. In the book, and other online sources, I have read that register RAX specifies the number of XMM registers to be used, according to calling conventions.
The code in the book goes as follows:

extern printf
section .data
num1        dq  9.0
num2        dq  73.0
fmt     db  "The numbers are %f and %f",10,0
f_sum       db  "%f + %f = %f",10,0

section .text
global main
main:
    push rbp
    mov rbp, rsp
printn:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, fmt
    mov rax, 2      ;for printf rax specifies amount of xmm registers
    call printf

sum:
    movsd xmm2, [num1]
    addsd xmm2, [num2]
printsum:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, f_sum
    mov rax, 3
    call printf

That works as expected.
Then, before the last printf call, I tried changing

mov rax, 3

for

mov rax, 1

Then I reassembled and ran the program.

I was expecting some different nonsense output, but I was surprised the output was exactly the same. printf outputs the 3 float values correctly:

The numbers are 9.000000 and 73.000000
9.000000 + 73.000000 = 82.000000

I suppose there is some kind of override when printf is expecting the use of several XMM registers, and as long as RAX is not 0, it will use consecutive XMM registers. I have searched for an explanation in calling conventions and NASM manual, but didn't find one.

What is the reason why this works?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

薄暮涼年 2025-01-30 13:49:58

x86-64 sysv abi 允许< / em>实现仅保存指定的XMM reg的确切数量,但是当前实现仅检查零 /非零,因为这是有效的,尤其是对于AL = 0常见情况。

如果您通过Al 1 中的数字低于XMM寄存器Args的实际数量,或高于8的数字,您将违反ABI,而仅此实现细节可以阻止您的代码从破裂。 (即它“碰巧起作用”,但不能由任何标准或文档保证,也不适用于其他一些实际实现,例如使用GCC4.5构建的较旧的GNU/Linux发行版)

this Q&amp; a 只需检查 al!= 0 ,与旧的GLIBC构建,该构建将跳跃目标计算为 Moveaps 商店的序列。 (Q&amp; a是关于 al&gt; 8 时的代码破裂,使计算机跳转到应该不应该的地方。)

为什么eax包含矢量参数的数量?引用ABI DOC,并显示ICC代码,其中ICC代码是类似地计算出来的使用与旧GCC相同的说明跳跃。


glibc的 printf 实现是从c源编译的,通常由gcc。编译时,当Modern GCC编译像printf这样的variadic函数时,它使ASM仅检查零与非 - 非 - 零al,将所有8个通过XMM寄存的所有8个ARG寄存在堆栈上的数组,如果非零。

GCC4.5及更早的实际 did 使用al中的数字进行计算的跳跃成 Movapaps 商店的序列,以实际上保存尽可能多的XMM regs。

Nate的简单示例来自 Godbolt 与GCC4.5对GCC11相比,与链接的答案相同,毫不奇怪,与旧/新Glibc(由GCC建造)的链接答案相同。此函数唯一曾经使用 va_arg(v,double); ,从来没有整数类型,因此它不会转移传入的rdi ... r9在任何地方,与不同, printf 。它是叶子功能,因此可以使用红色区域(RSP以下128个字节)。

# GCC4.5.3 -O3 -fPIC    to compile like glibc would
add_them:
        movzx   eax, al
        sub     rsp, 48                  # reserve stack space, needed either way
        lea     rdx, 0[0+rax*4]          # each movaps is 4 bytes long
        lea     rax, .L2[rip]            # code pointer to after the last movaps
        lea     rsi, -136[rsp]             # used later by va_arg.  test/jz version does the same, but after the movaps stores
        sub     rax, rdx
        lea     rdx, 39[rsp]               # used later by va_arg, test/jz version also does an LEA like this
        jmp     rax                      # AL=0 case jumps to L2
        movaps  XMMWORD PTR -15[rdx], xmm7     # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
        movaps  XMMWORD PTR -31[rdx], xmm6
        movaps  XMMWORD PTR -47[rdx], xmm5
        movaps  XMMWORD PTR -63[rdx], xmm4
        movaps  XMMWORD PTR -79[rdx], xmm3
        movaps  XMMWORD PTR -95[rdx], xmm2
        movaps  XMMWORD PTR -111[rdx], xmm1
        movaps  XMMWORD PTR -127[rdx], xmm0   # xmm0 last, will be ready for store-forwading last
.L2:
        lea     rax, 56[rsp]       # first stack arg (if any), I think
     ## rest of the function

相对于

# GCC11.2 -O3 -fPIC
add_them:
        sub     rsp, 48
        test    al, al
        je      .L15                          # only one test&branch macro-fused uop
        movaps  XMMWORD PTR -88[rsp], xmm0    # xmm0 first
        movaps  XMMWORD PTR -72[rsp], xmm1
        movaps  XMMWORD PTR -56[rsp], xmm2
        movaps  XMMWORD PTR -40[rsp], xmm3
        movaps  XMMWORD PTR -24[rsp], xmm4
        movaps  XMMWORD PTR -8[rsp], xmm5
        movaps  XMMWORD PTR 8[rsp], xmm6
        movaps  XMMWORD PTR 24[rsp], xmm7
.L15:
        lea     rax, 56[rsp]        # first stack arg (if any), I think
        lea     rsi, -136[rsp]      # used by va_arg.  done after the movaps stores instead of before.
...
        lea     rdx, 56[rsp]        # used by va_arg.  With a different offset than older GCC, but used somewhat similarly.  Redundant with the LEA into RAX; silly compiler.

GCC大概改变了策略,因为计算的跳跃需要更多的静态代码大小(I-CACHE足迹),而测试/JZ比间接跳跃更容易预测。更重要的是,在Common Al = 0(no-XMM)情况下执行的UOPS 2 。即使对于Al = 1最坏情况,也没有更多(7 Dead Movaps 存储,但没有完成计算分支目标的工作)。


相关Q&amp; as:

在谈论违反呼叫违规的同时,半相关:


脚注1:al,而不是rax之外,这是

x86-64系统v abi doc指定variadic函数必须仅针对REG数量的AL来看; RAX的高7个字节可容纳垃圾。 MOV EAX,3 是设置Al的有效方法通过编写部分寄存器可能的错误依赖性,尽管机器代码大小(5个字节)大于 mov al,3 (2个字节)。 clang通常使用 mov al,3

来自ABI DOC的要点,请参见为什么EAX包含EAX号码矢量参数吗?有关更多上下文:

序言应使用%al 避免不必要地保存XMM寄存器。这对于仅整数唯一的程序以防止XMM单元的初始化而尤其重要。

(最后一点是过时的:XMM regs广泛用于memcpy/memset,并将其嵌入到零输入的小阵列/结构中。如此之多,以至于Linux使用“急切”的fpu save/Restore/Restore在上下文开关上,而不是“懒惰”的地方。首次使用XMM Reg故障。)

%al al 的内容不需要与寄存器的数量完全匹配,但必须对所使用的向量寄存器的数量进行上限,并且在0-8范围内包含在内。<<<<<<。 /p>

Al&lt; = 8的ABI保证允许计算成问题的实现省略界限检查。 (类似地, C ++标准是否允许非专业化的布尔崩溃程序?是的,可以假定ABI违规行为不发生,例如在这种情况下制作代码。)


脚注2:两种策略的效率:

较小的静态代码大小(I-Cache足迹)总是一件好事,而Al!= 0策略对此有利。

最重要的是,对于AL == 0情况下执行的总说明更少。 printf 不是唯一的variadic函数; sscanf 并不罕见,它永远不会乘以FP Args(仅指针)。如果编译器可以看到函数永远不会使用fp参数使用 va_arg ,则它将完全省略保存,使这一点失调,但是SCANF/PRINTF函数通常被用作 vfscanf的包装器/ vfprintf 呼叫,因此编译器不参见请参阅,它看到 va_list 被传递给另一个功能,以便它具有保存一切。 (我认为人们编写自己的变异功能是相当罕见的,因此在许多程序中,唯一对variadic函数的呼叫将是图书馆函数

。 ; 8但是,由于管道和商店的缓冲区宽,但非零案例,与这些商店的发生同时开始进行真实的工作。

计算和进行间接跳跃采用5个总指令,而不是计算 lea rsi,-136 [rsp] lea rdx,39 [rsp] 。在Movaps商店之后,Test/JZ策略还可以执行或类似的操作,作为 va_arg 代码的设置看着堆栈。

我也不要计算 sub rsp,48 ;无论哪种方式,除非您也使XMM-Save-Save-save-suble尺寸变量变量,否则这是必要的,或者仅节省每个XMM reg的低半部分,因此8x 8 B = 64字节可以适合红色区域。从理论上讲,variadic函数可以在XMM reg中采用16字节 __ m128d arg,因此GCC使用 Movapaps 而不是 movlps 。 (我不确定Glibc printf是否有任何转换)。在像实际printf这样的非叶片功能中,您始终需要保留更多的空间,而不是使用红色区域。 (这是 lea rdx,39 [rsp] 的原因之一:每个 Movepaps 需要完全4个字节,因此编译器的生成食谱该代码必须确保其偏移量在 [reg+disp8] 地址模式的[-128,+127]范围内,而不是 0 的ASM语法来强制使用更长的指令。

要使用特殊 “ https://agner.org/optimize/” rel =“ nofollow noreferrer”> https://agner.org/optimize/ ),我们通常会在128-byte区域下方触摸堆栈空间无论如何

。 以保存XMM ARG。

5个整数ALU/跳跃,一个动作) , 。 1个宏观融合的测试和分支,8个动作存储。

这就是计算机套件版本的最佳案例:使用更多的XMM args,它仍然运行5个指令来计算跳跃目标,但必须运行更多的动作指令。测试/JZ版本始终是9个UOPS。因此,动态UOP计数的收支平衡点(实际执行,与坐在内存中占用I-Cache足迹)是4 xmm args,这可能很少见,但它具有其他优势。特别是在Al == 0的情况下,它是5 vs. 1。

测试/JZ分支总是在任何数量的XMM args a grands a grow 对于 printf(“%f%f \ n”,...) vs “%f \ n” 是不同的。

在计算机跳跃版本中的5个说明中的3个指令中的3个(不包括JMP)形成了来自传入AL的依赖关系链,因此可以检测到错误预测之前的更多周期(即使该链可能以a 在通话之前,MOV EAX,1 )。但是,垃圾场中的“额外”说明只是XMM1..7中一些永不重新加载且不属于任何依赖关系链的XMM1..7的死商店。只要商店的缓冲区和Rob/RS可以吸收它们,就可以在其闲暇时努力工作。

(公平地说,他们会捆绑一段时间的商店数据和商店 - 地址执行单元,这意味着后来的商店也不会尽快准备好商店。与负载相同的执行单元,以后的负载可以被那些uops拖延这些执行单元。 使用这样的简单地址模式。

, (大多数变量函数都按顺序进行arg)。如果有多个XMM args,那么计算出的方式才可以准备从该商店存储前面,直到几个周期之后。但是对于具有Al = 1的情况,这是唯一的XMM商店,并且没有其他工作将负载/存储 - 地址执行单元绑定,而少量ARG可能更常见。

与较小的代码足迹相比,这些原因中的大多数确实很小,并且针对AL == 0案例执行的说明更少。 (对于我们中的某些人来说)是在现代简单方式的上/下面思考,这很有趣,表明即使在最坏的情况下,这也不是问题。

The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.

If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)

This Q&A shows a current build of glibc printf which just checks for AL!=0, vs. an old build of glibc which computes a jump target into a sequence of movaps stores. (That Q&A is about that code breaking when AL>8, making the computed jump go somewhere it shouldn't.)

Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.


Glibc's printf implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.

GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps stores, to only actually save as many XMM regs as necessary.

Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf. And it's a leaf function so it can use the red-zone (128 bytes below RSP).

# GCC4.5.3 -O3 -fPIC    to compile like glibc would
add_them:
        movzx   eax, al
        sub     rsp, 48                  # reserve stack space, needed either way
        lea     rdx, 0[0+rax*4]          # each movaps is 4 bytes long
        lea     rax, .L2[rip]            # code pointer to after the last movaps
        lea     rsi, -136[rsp]             # used later by va_arg.  test/jz version does the same, but after the movaps stores
        sub     rax, rdx
        lea     rdx, 39[rsp]               # used later by va_arg, test/jz version also does an LEA like this
        jmp     rax                      # AL=0 case jumps to L2
        movaps  XMMWORD PTR -15[rdx], xmm7     # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
        movaps  XMMWORD PTR -31[rdx], xmm6
        movaps  XMMWORD PTR -47[rdx], xmm5
        movaps  XMMWORD PTR -63[rdx], xmm4
        movaps  XMMWORD PTR -79[rdx], xmm3
        movaps  XMMWORD PTR -95[rdx], xmm2
        movaps  XMMWORD PTR -111[rdx], xmm1
        movaps  XMMWORD PTR -127[rdx], xmm0   # xmm0 last, will be ready for store-forwading last
.L2:
        lea     rax, 56[rsp]       # first stack arg (if any), I think
     ## rest of the function

vs.

# GCC11.2 -O3 -fPIC
add_them:
        sub     rsp, 48
        test    al, al
        je      .L15                          # only one test&branch macro-fused uop
        movaps  XMMWORD PTR -88[rsp], xmm0    # xmm0 first
        movaps  XMMWORD PTR -72[rsp], xmm1
        movaps  XMMWORD PTR -56[rsp], xmm2
        movaps  XMMWORD PTR -40[rsp], xmm3
        movaps  XMMWORD PTR -24[rsp], xmm4
        movaps  XMMWORD PTR -8[rsp], xmm5
        movaps  XMMWORD PTR 8[rsp], xmm6
        movaps  XMMWORD PTR 24[rsp], xmm7
.L15:
        lea     rax, 56[rsp]        # first stack arg (if any), I think
        lea     rsi, -136[rsp]      # used by va_arg.  done after the movaps stores instead of before.
...
        lea     rdx, 56[rsp]        # used by va_arg.  With a different offset than older GCC, but used somewhat similarly.  Redundant with the LEA into RAX; silly compiler.

GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead movaps stores but no work done computing a branch target).


Related Q&As:

Semi-related while we're talking about calling-convention violations:


Footnote 1: AL, not RAX, is what matters

The x86-64 System V ABI doc specifies that variadic functions must look only at AL for the number of regs; the high 7 bytes of RAX are allowed to hold garbage. mov eax, 3 is an efficient way to set AL, avoiding possible false dependencies from writing a partial register, although it is larger in machine-code size (5 bytes) than mov al,3 (2 bytes). clang typically uses mov al, 3.

Key points from the ABI doc, see Why does eax contain the number of vector parameters? for more context:

The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.

(That last point is obsolete: XMM regs are widely used for memcpy/memset and inlined to zero-init small arrays / structs. So much so that Linux uses "eager" FPU save/restore on context switches, not "lazy" where the first use of an XMM reg faults.)

The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

This ABI guarantee of AL <= 8 is what allows computed-jump implementations to omit bounds-checking. (Similarly, Does the C++ standard allow for an uninitialized bool to crash a program? yes, ABI violations can be assumed not to happen, e.g. by making code that would crash in that case.)


Footnote 2: efficiency of the two strategies

Smaller static code-size (I-cache footprint) is always a good thing, and the AL!=0 strategy has that in its favour.

Most importantly, fewer total instructions executed for the AL==0 case. printf isn't the only variadic function; sscanf is not rare, and it never takes FP args (only pointers). If a compiler can see that a function never uses va_arg with an FP argument, it omits saving entirely, making this point moot, but the scanf/printf functions are normally implemented as wrappers for the vfscanf / vfprintf calls, so the compiler doesn't see that, it sees a va_list being passed to another function so it has to save everything. (I think it's fairly rare for people to write their own variadic functions, so in a lot of programs the only calls to variadic functions will be to library functions.)

Out-of-order exec can chew through the dead stores just fine for AL<8 but non-zero cases, thanks to wide pipelines and store buffers, getting started on the real work in parallel with those stores happening.

Computing and doing the indirect jump takes 5 total instructions, not counting the lea rsi, -136[rsp] and lea rdx, 39[rsp]. The test/jz strategy also does those or similar, just after the movaps stores, as setup for the va_arg code which has to figure out when it gets to the end of the register-save area and switch to looking at stack args.

I'm also not counting the sub rsp, 48 either; that's necessary either way unless you make the XMM-save-area size variable as well, or only save the low half of each XMM reg so 8x 8 B = 64 bytes would fit in the red-zone. In theory variadic functions can take a 16-byte __m128d arg in an XMM reg so GCC uses movaps instead of movlps. (I'm not sure if glibc printf has any conversions that would take one). And in non-leaf functions like actual printf, you'd always need to reserve more space instead of using the red-zone. (This is one reason for the lea rdx, 39[rsp] in the computed-jump version: every movaps needs to be exactly 4 bytes, so the compiler's recipe for generating that code has to make sure their offsets are in the [-128,+127] range of a [reg+disp8] addressing mode, and not 0 unless GCC was going to use special asm syntax to force a longer instruction there.

Almost all x86-64 CPUs run 16-byte stores as a single micro-fused uop (only crusty old AMD K8 and Bobcat splitting into 8-byte halves; see https://agner.org/optimize/), and we'd usually be touching stack space below that 128-byte area anyway. (Also, the computed-jump strategy stores to the bottom itself, so it doesn't avoid touching that cache line.)

So for a function with one XMM arg, the computed-jump version takes 6 total single-uop instructions (5 integer ALU/jump, one movaps) to get the XMM arg saved.

The test/jz version takes 9 total uops (10 instructions but test/jz macro-fuse in 64-bit mode on Intel since Nehalem, AMD since Bulldozer IIRC). 1 macro-fused test-and-branch, and 8 movaps stores.

And that's the best case for the computed-jump version: with more xmm args, it still runs 5 instructions to compute the jump target, but has to run more movaps instructions. The test/jz version is always 9 uops. So the break-even point for dynamic uop count (actually executed, vs. sitting there in memory taking up I-cache footprint) is 4 XMM args which is probably rare, but it has other advantages. Especially in the AL == 0 case where it's 5 vs. 1.

The test/jz branch always goes to the same place for any number of XMM args except zero, making it easier to predict than an indirect branch that's different for printf("%f %f\n", ...) vs "%f\n".

3 of the 5 instructions (not including the jmp) in the computed-jump version form a dependency chain from the incoming AL, making it take that many more cycles before a misprediction can be detected (even though the chain probably started with a mov eax, 1 right before the call). But the "extra" instructions in the dump-everything strategy are just dead stores of some of XMM1..7 that never get reloaded and aren't part of any dependency chain. As long as the store buffer and ROB/RS can absorb them, out-of-order exec can work on them at its leisure.

(To be fair, they will tie up the store-data and store-address execution units for a while, meaning that later stores won't be ready for store-forwarding as soon either. And on CPUs where store-address uops run on the same execution units as loads, later loads can be delayed by those store uops hogging those execution units. Fortunately, modern CPUs have at least 2 load execution units, and Intel from Haswell to Skylake can run store-address uops on any of 3 ports, with simple addressing modes like this. Ice Lake has 2 load / 2 store ports with no overlap.)

The computed jump version has save XMM0 last, which is likely to be the first arg reloaded. (Most variadic functions go through their args in order). If there are multiple XMM args, the computed-jump way won't be ready to store-forward from that store until a couple cycles later. But for cases with AL=1 that's the only XMM store, and no other work tying up load/store-address execution units, and small numbers of args are probably more common.

Most of these reasons are really minor compared to the advantage of smaller code footprint, and fewer instructions executed for the AL==0 case. It's just fun (for some of us) to think through the up/down sides of the modern simple way, to show that even in its worst case, it's not a problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文