为什么 Windows64 使用与 x86-64 上所有其他操作系统不同的调用约定？

发布于 2024-10-07 12:10:09 字数 533 浏览 5 评论 0原文

AMD 有一个 ABI 规范，描述了在 x86-64 上使用的调用约定。所有操作系统都遵循它，但 Windows 除外，它有自己的 x86-64 调用约定。为什么？

有谁知道这种差异的技术、历史或政治原因，还是纯粹是 NIH 综合症的问题？

我知道不同的操作系统可能对更高级别的事物有不同的需求，但这并不能解释为什么例如 Windows 上的寄存器参数传递顺序是 rcx - rdx - r8 - r9 - rest on stack 而其他人都使用rdi - rsi - rdx - rcx - r8 - r9 - rest on stack。

PS 我知道这些调用约定通常有何不同，并且我知道在需要时在哪里可以找到详细信息。我想知道的是为什么。

编辑：有关操作方法，请参阅 wikipedia 条目以及那里的链接。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

唐婉 2024-10-14 12:10:09

在 x64 上选择四个参数寄存器 - UN*X / Win64 常见的关于

x86 需要记住的事情之一是寄存器名称到“reg number”编码并不明显；就指令编码而言（MOD R/M字节，请参见http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm)，寄存器编号 0...7 依次为 - ?AX< /code>, ?CX, ?DX, ?BX, ?SP, ?BP< /代码>，<代码>？SI，<代码>？DI。

因此，选择 A/C/D (regs 0..2) 作为返回值和前两个参数（这是“经典”32 位 __fastcall 约定）是一个合乎逻辑的选择。就 64 位而言，“更高”的规则是有序的，Microsoft 和 UN*X/Linux 都将 R8 / R9 作为第一个规则。

牢记这一点，Microsoft 选择了 RAX（返回值）和 RCX、RDX、R8、如果您选择四个寄存器作为参数，>R9 (arg[0..3]) 是一个可以理解的选择。

我不知道为什么 AMD64 UN*X ABI 在 RCX 之前选择了 RDX。

在 x64 上选择六个参数寄存器 - UN*X 特定的

UN*X，在 RISC 架构上，传统上在寄存器中完成参数传递 - 特别是对于前六个参数（至少在 PPC、SPARC、MIPS 上是如此）。这可能是 AMD64 (UN*X) ABI 设计者选择在该架构上使用六个寄存器的主要原因之一。

因此，如果您想要六个个寄存器来传递参数，那么选择RCX、RDX、是合乎逻辑的>R8 和 R9 其中四个，您应该选择哪两个？

“更高”的寄存器需要额外的指令前缀字节来选择它们，因此具有更大的指令大小占用空间，因此如果您有选择，您不会想选择其中任何一个。在经典寄存器中，由于 RBP 和 RSP 的隐式含义，这些寄存器不可用，而 RBX传统上，UN*X（全局偏移表）有特殊用途，AMD64 ABI 设计者似乎不想不必要地与之不兼容。
因此，唯一的选择是RSI / RDI。

因此，如果您必须将 RSI / RDI 作为参数寄存器，那么它们应该是哪些参数？

让它们成为 arg[0] 和 arg[1] 有一些优点。参见cHao的评论。
?SI 和 ?DI 是字符串指令源/目标操作数，正如 cHao 提到的，它们用作参数寄存器意味着使用 AMD64 UN*X 调用约定，最简单的例如，可能的 strcpy() 函数仅包含两个 CPU 指令 repz movsb; ret 因为源/目标地址已被调用者放入正确的寄存器中。尤其是在低级和编译器生成的“粘合”代码中（例如，一些 C++ 堆分配器在构造时零填充对象，或者 sbrk() 上的内核零填充堆页面）。 code> 或写时复制页错误）大量的块复制/填充，因此对于经常用于保存两个或三个 CPU 指令的代码非常有用，否则这些指令会加载此类源/目标地址参数进入“正确”的寄存器。

因此，在某种程度上，UN*X 和 Win64 的唯一不同之处在于 UN*X 在有意选择的 RSI/RDI 寄存器中“前置”两个附加参数，以自然在 RCX、RDX、R8 和 R9 中选择四个参数。

除此之外...

UN*X 和 Windows x64 ABI 之间还有更多差异，而不仅仅是参数到特定寄存器的映射。有关 Win64 的概述，请查看：

http://msdn.microsoft.com/en -us/library/7kcdt6fy.aspx

Win64 和 AMD64 UN*X 在堆栈空间的使用方式上也有显着差异；例如，在 Win64 上，调用者必须为函数参数分配堆栈空间，即使参数 0...3 是在寄存器中传递的。另一方面，在 UN*X 上，如果叶函数（即不调用其他函数的函数）需要的堆栈空间不超过 128 字节，则根本不需要分配堆栈空间（是的，您拥有并可以使用一定数量的堆栈而不分配它......好吧，除非你是内核代码，一个漂亮的错误的来源）。所有这些都是特定的优化选择，其大部分基本原理都在原始发布者的维基百科参考指向的完整 ABI 参考中进行了解释。

Choosing four argument registers on x64 - common to UN*X / Win64

One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX, ?CX, ?DX, ?BX, ?SP, ?BP, ?SI, ?DI.

Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8 / R9 as the first ones.

Keeping that in mind, Microsoft's choice of RAX (return value) and RCX, RDX, R8, R9 (arg[0..3]) are an understandable selection if you choose four registers for arguments.

I don't know why the AMD64 UN*X ABI chose RDX before RCX.

Choosing six argument registers on x64 - UN*X specific

UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.

So if you want six registers to pass arguments in, and it's logical to choose RCX, RDX, R8 and R9 for four of them, which other two should you pick ?

The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP and RSP these aren't available, and RBX traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI / RDI.

So if you have to take RSI / RDI as argument registers, which arguments should they be ?

Making them arg[0] and arg[1] has some advantages. See cHao's comment.
?SI and ?DI are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy() function, for example, only consists of the two CPU instructions repz movsb; ret because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk(), or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.

So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI/RDI registers, to the natural choice of four arguments in RCX, RDX, R8 and R9.

Beyond that ...

There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:

http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx

Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.

回复收藏 0 原文

还不是爱你 2024-10-14 12:10:09

我不知道 Windows 为何这么做。请参阅此答案的末尾进行猜测。我很好奇 SysV 调用约定是如何决定的，所以我深入研究了邮件列表存档并发现了一些巧妙的东西。

阅读 AMD64 邮件列表上的一些旧线程很有趣，因为 AMD 架构师对此很活跃。例如，选择寄存器名称是困难的部分之一：AMD 认为重命名原始的 8 个寄存器 r0-r7，或调用新寄存器 UAX 等。

此外，内核开发人员的反馈还确定了原始的 8 个寄存器< 的设计code>syscall 和 swapgs 不可用。这就是 AMD

SysV (Linux) 调用约定以及应保留多少寄存器与调用者保存多少寄存器的决定是最初由 Jan Hubicka（gcc 开发人员）于 2000 年 11 月制作。他编译了SPEC2000 并查看了代码大小和指令数量。该讨论线程围绕着一些与此问题的答案和评论相同的想法。在第二个线程中，他提议当前序列为最佳序列，并希望是最终序列，生成的代码比某些替代方案更小。

他使用“全局”一词来表示呼叫保留寄存器，如果使用则必须压入/弹出。

选择 rdi、rsi、rdx 作为前三个参数的动机是：

在调用 的函数中节省少量代码大小memset 或其他 C 字符串函数在其参数上（其中 gcc 内联了一个代表字符串操作？）
rbx 是调用保留的，因为有两个调用保留的寄存器可以在没有 REX 前缀的情况下访问（rbx 和 rbp）是一个胜利。大概选择它们是因为它们是唯一不被任何通用指令隐式使用的“传统”寄存器。（代表字符串、移位计数和 mul/div 输出/输入涉及其他所有内容）。
通用指令中没有一个寄存器强制您use 是调用保留的（请参阅上一点），因此想要使用变量计数移位或除法的函数可能必须将函数参数移动到其他地方，但不必保存/恢复调用者的值。 cmpxchg16b 和 cpuid 需要 RBX，但很少使用，所以不是一个大因素。（cmpxchg16b 不是原始 AMD64 的一部分，但 RBX 仍然是显而易见的选择。cmpxchg8b 存在，但已被 qword cmpxchg 废弃）
<块引用>
我们试图在序列的早期避免 RCX，因为它是寄存器
通常用于特殊目的，如 EAX，因此它具有相同的目的
序列中缺失。
它也不能用于系统调用，我们希望创建系统调用序列
尽可能匹配函数调用顺序。

（背景：syscall / sysret 不可避免地会破坏 rcx（使用 rip）和 r11 （使用 RFLAGS），因此当 syscall 运行时，内核无法看到 rcx 中最初的内容。）

内核系统调用 ABI 是选择匹配函数调用 ABI，除了 r10 而不是 rcx，因此像 mmap(2) 这样的 libc 包装函数可以只 mov %rcx, %r10 / mov $0x9, %eax / 系统调用。

请注意，与 Window 的 32 位 __vectorcall 相比，i386 Linux 使用的 SysV 调用约定很糟糕。它传递堆栈上的所有内容，并且仅在 edx:eax 适用于 int64，不适用于小型结构。毫不奇怪，我们几乎没有付出什么努力来保持与它的兼容性。当没有理由不这样做时，他们会做诸如保留 rbx 调用之类的事情，因为他们认为在原始 8 中拥有另一个（不需要 REX 前缀）是好的。

从长远来看，使 ABI 达到最佳状态比任何其他考虑因素都要重要得多。我认为他们做得很好。我不完全确定是否返回打包到寄存器中的结构，而不是返回不同寄存器中的不同字段。我猜想通过值传递它们而不实际对字段进行操作的代码会以这种方式获胜，但是解包的额外工作似乎很愚蠢。他们可以有更多的整数返回寄存器，而不仅仅是 rdx:rax，因此返回具有 4 个成员的结构可以以 rdi、rsi、rdx、rax 或其他形式返回它们。

他们考虑在向量寄存器中传递整数，因为 SSE2 可以对整数进行操作。幸运的是他们没有这样做。整数经常用作指针偏移量，并且堆栈内存的往返非常便宜。此外，SSE2 指令比整数指令占用更多的代码字节。

我怀疑 Windows ABI 设计者的目标可能是最小化 32 位和 64 位之间的差异，以方便那些必须将 asm 从一个移植到另一个的人，或者可以在某些情况下使用几个 #ifdef 的人。 ASM 使同一源可以更轻松地构建 32 或 64 位版本的函数。

最小化工具链的变化似乎不太可能。 x86-64 编译器需要一个单独的表，其中列出寄存器的用途以及调用约定。与 32 位有少量重叠不太可能显着节省工具链代码大小/复杂性。

IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.

It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX etc.

Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.

The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.

He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.

The choice of rdi, rsi, rdx as the first three args was motivated by:

minor code-size saving in functions that call memset or other C string function on their args (where gcc inlines a rep string operation?)
rbx is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because they're the only "legacy" registers that aren't implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).
None of the registers that common instructions force you to use are call-preserved (see prev point), so a function that wants to use a variable-count shift or division might have to move function args somewhere else, but doesn't have to save/restore the caller's value. cmpxchg16b and cpuid need RBX, but are rarely used so not a big factor. (cmpxchg16b wasn't part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b exists but was obsoleted by qword cmpxchg)
We are trying to avoid RCX early in the sequence, since it is register
used commonly for special purposes, like EAX, so it has same purpose to be
missing in the sequence.
Also it can't be used for syscalls and we would like to make syscall sequence
to match function call sequence as much as possible.

(background: syscall / sysret unavoidably destroy rcx(with rip) and r11(with RFLAGS), so the kernel can't see what was originally in rcx when syscall ran.)

The kernel system-call ABI was chosen to match the function call ABI, except for r10 instead of rcx, so a libc wrapper functions like mmap(2) can just mov %rcx, %r10 / mov $0x9, %eax / syscall.

Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.

Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.

They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.

I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdefs in some ASM so the same source can more easily build a 32 or 64bit version of a function.

Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.

回复收藏 0 原文

飘过的浮云 2024-10-14 12:10:09

请记住，微软最初“官方对早期 AMD64 的努力不置可否”（摘自 “现代 64 位计算的历史”，作者：Matthew Kerner 和 Neil Padgett），因为他们是 Intel 在 IA64 架构方面的强有力的合作伙伴。我认为这意味着即使他们愿意与 GCC 工程师合作开发 ABI 以在 Unix 和 Windows 上使用，他们也不会这样做，因为这意味着公开支持 AMD64 的努力，而他们没有这样做。尚未正式这样做（并且可能会令英特尔感到不安）。

最重要的是，当时微软完全没有对开源项目友好的倾向。当然不是 Linux 或 GCC。

那么他们为什么要在 ABI 上进行合作呢？我猜想 ABI 之所以不同，只是因为它们或多或少是同时且独立设计的。

另一段引自《现代 64 位计算的历史》：

与 Microsoft 合作的同时，AMD 还与
开源社区为芯片做准备。 AMD 与
Code Sorcery 和 SuSE 都用于工具链工作（红帽已经
由英特尔参与 IA64 工具链端口）。拉塞尔解释说
SuSE 生产了 C 和 FORTRAN 编译器，Code Sorcery 生产了
帕斯卡编译器。韦伯解释说，该公司还与
Linux 社区准备了一个 Linux 移植版。这个努力非常
重要的是：它激励了微软继续
投资于 AMD64 Windows 的努力，同时也确保了 Linux，这
当时正在成为一个重要的操作系统，一旦
芯片发布。
Weber 甚至表示 Linux 工作绝对至关重要
AMD64 的成功，因为它使 AMD 能够生产出端到端的
如有必要，无需任何其他公司的帮助。这
这种可能性确保了AMD即使在最坏情况下也有生存策略
如果其他合伙人退出，这反过来又让其他合伙人留下来
因为害怕被抛在后面而忙碌。

这表明，就连AMD也并不认为MS和Unix之间的合作一定是最重要的，而对Unix/Linux的支持才是非常重要的。也许甚至试图说服一方或双方妥协或合作也不值得付出努力或冒险（？）激怒他们中的任何一方？也许 AMD 认为，即使建议通用的 ABI，也可能会延迟或破坏更重要的目标，即在芯片准备就绪时就准备好软件支持。

这是我的猜测，但我认为 ABI 不同的主要原因是政治原因，即 MS 和 Unix/Linux 双方没有在这方面合作，AMD 并不认为这是一个问题。

Remember that Microsoft was initially "officially noncommittal toward the early AMD64 effort" (from "A History of Modern 64-bit Computing" by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn't have done so as it would mean publicly supporting the AMD64 effort when they hadn't yet officially done so (and would have probably upset Intel).

On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.

So why would they have cooperated on an ABI? I'd guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.

Another quote from "A History of Modern 64-bit Computing":

In parallel with the Microsoft collaboration, AMD also engaged the
open source community to prepare for the chip. AMD contracted with
both Code Sorcery and SuSE for tool chain work (Red Hat was already
engaged by Intel on the IA64 tool chain port). Russell explained that
SuSE produced C and FORTRAN compilers, and Code Sorcery produced a
Pascal compiler. Weber explained that the company also engaged with
the Linux community to prepare a Linux port. This effort was very
important: it acted as an incentive for Microsoft to continue to
invest in the AMD64 Windows effort, and also ensured that Linux, which
was becoming an important OS at the time, would be available once the
chips were released.
Weber goes so far as to say that the Linux work was absolutely crucial
to AMD64’s success, because it enabled AMD to produce an end-to-end
system without the help of any other companies if necessary. This
possibility ensured that AMD had a worst-case survival strategy even
if other partners backed out, which in turn kept the other partners
engaged for fear of being left behind themselves.

This indicates that even AMD didn't feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn't worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.

Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn't work together on it, and AMD didn't see that as a problem.

回复收藏 0 原文