fastcall真的更快吗?
fastcall 调用约定真的比其他调用约定(例如 cdecl)更快吗? 是否有任何基准可以显示调用约定如何影响性能?
Is the fastcall calling convention really faster than other calling conventions, such as cdecl?
Are there any benchmarks out there that show how performance is affected by calling convention?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这取决于平台。例如,对于 Xenon PowerPC,由于在堆栈上传递数据时存在加载-命中-存储问题,因此可能存在一个数量级的差异。根据经验,我将
cdecl
函数的开销计时为大约 45 个周期,而fastcall
的开销约为 4 个周期。对于乱序的 x86(Intel 和 AMD),影响可能要小得多,因为无论如何寄存器都被隐藏和重命名。
答案确实是,您需要在您关心的特定平台上自行进行基准测试。
It depends on the platform. For a Xenon PowerPC, for example, it can be an order of magnitude difference due to a load-hit-store issue with passing data on the stack. I empirically timed the overhead of a
cdecl
function at about 45 cycles compared to ~4 for afastcall
.For an out-of-order x86 (Intel and AMD), the impact may be much less, because the registers are all shadowed and renamed anyway.
The answer really is that you need to benchmark it yourself on the particular platform you care about.
我相信 Microsoft 在 x86 和 x64 上实现 fastcall 需要在寄存器中而不是在堆栈上传递前两个参数。
由于它通常可以节省至少四次内存访问,所以它通常更快。然而,如果所涉及的函数是寄存器匮乏的,因此很可能将它们写入堆栈上的局部变量,则不可能有显着的增加。
I believe that Microsofts implementation of
fastcall
on x86 and x64 involves passing the first two parameters in registers instead of on the stack.Since it typically saves at least four memory accesses, yes it is generally faster. However, if the function involved is register-starved and is thus likely to write them to locals on the stack anyway, there's not likely to be a significant increase.
调用约定(至少在 x86 上)并不会真正对速度产生太大影响。在 Windows 中,
_stdcall
被设为默认值,因为与_cdecl
相比,它通常会产生更小的代码大小,从而为重要的程序产生切实的结果。_fastcall
不是默认值,因为它造成的差异远不那么明显。通过寄存器传递参数所弥补的,是在效率较低的函数体中丢失的(如 Anon 之前提到的)。如果被调用的函数立即需要将所有内容溢出到内存中以进行自己的计算,那么通过传递寄存器将不会获得任何好处。然而,我们可以整天滔滔不绝地滔滔不绝地阐述理论思想——对你的代码进行基准测试以获得正确的答案。
_fastcall
在某些情况下会更快,而在其他情况下会更慢。Calling convention (at least on x86) doesn't really make much of a difference in speed. In Windows,
_stdcall
was made the default because it produces tangible results for nontrivial programs in that it usually results in smaller code size when compared with_cdecl
._fastcall
is not the default value because the difference it makes is far less tangible. What you make up for in argument passing via registers you lose in less efficient function bodies (as previously mentioned by Anon.). You don't gain anything by passing in registers if the called function immediately needs to spill everything out into memory for its own calculations.However, we can spout theoretical ideas all day long -- benchmark your code for the right answer.
_fastcall
will be faster in some cases, and slower in others.在现代 x86 上 - 不。在 L1 缓存和内联之间没有快速调用的空间。
On modern x86 - no. Between L1 cache and in-lining there's no place for fastcall.