“输入”对比“push ebp”移动 ebp,特别是;子esp,imm”和“离开”与“mov esp, ebp;”比较弹出ebp”
enter
和指令有什么区别
push ebp
mov ebp, esp
sub esp, imm
?有性能差异吗?如果是这样,哪个更快?为什么编译器总是使用后者?
与 leave
和
mov esp, ebp
pop ebp
说明类似。
What is the difference between the enter
and
push ebp
mov ebp, esp
sub esp, imm
instructions? Is there a performance difference? If so, which is faster and why do compilers always use the latter?
Similarly with the leave
and
mov esp, ebp
pop ebp
instructions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
性能存在差异,尤其是对于
enter
而言。在现代处理器上,这会解码为大约 10 到 20 µops,而三个指令序列约为 4 到 6 个,具体取决于架构。有关详细信息,请参阅 Agner Fog 的说明表。此外,与三个指令序列的 3 个时钟依赖链相比,
enter
指令通常具有相当高的延迟,例如 core2 上的 8 个时钟。此外,编译器可以出于调度目的而展开这三个指令序列,当然这取决于周围的代码,以允许指令的更多并行执行。
There is a performance difference, especially for
enter
. On modern processors this decodes to some 10 to 20 µops, while the three instruction sequence is about 4 to 6, depending on the architecture. For details consult Agner Fog's instruction tables.Additionally the
enter
instruction usually has a quite high latency, for example 8 clocks on a core2, compared to the 3 clocks dependency chain of the three instruction sequence.Furthermore the three instruction sequence may be spread out by the compiler for scheduling purposes, depending on the surrounding code of course, to allow more parallel execution of instructions.
在设计 80286 时,Intel 的 CPU 设计人员决定添加两条指令来帮助维护显示。
这里是 CPU 内部的微代码:
ENTER 的替代方案是:
;输入n, 0;在486上输入14个周期
;在 486 上输入 n, 1 ;17 个周期
;输入 n, 3 ;在 486
等上输入 23 个周期。长距离可能会增加文件大小,但速度更快。
最后一点,程序员不再真正使用显示,因为这是一种非常缓慢的解决方法,使得 ENTER 现在变得毫无用处。
资料来源:https://courses.engr.illinois。 edu/ece390/books/artofasm/CH12/CH12-3.html
When designing the 80286, Intel's CPU designers decided to add two instructions to help maintain displays.
Here the micro code inside the CPU:
Alternative to ENTER would be:
; enter n, 0 ;14 cycles on the 486
; enter n, 1 ;17 cycles on the 486
; enter n, 3 ;23 cycles on the 486
Etc. The long way might increase your file size, but is way quicker.
One last note, programmer don't really use display anymore since that was a very slow work around, making ENTER pretty useless now.
Source: https://courses.engr.illinois.edu/ece390/books/artofasm/CH12/CH12-3.html
enter
在所有 CPU 上都慢得无法使用,除了可能以牺牲速度为代价的代码大小优化之外,没有人使用它。 (如果确实需要帧指针,或者希望允许更紧凑的寻址模式来寻址堆栈空间。)leave
足够快,值得使用< /strong>,并且 GCC 确实使用它(如果 ESP / RSP 尚未指向已保存的 EBP/RBP;否则它只使用pop ebp
)。leave
在现代 Intel CPU 上仅为 3 uops(在某些 AMD 上为 2 uops)。 (https://agner.org/optimize/,https://uops.info/)。mov / pop 总共只有 2 uops(在现代 x86 上,其中“堆栈引擎”跟踪 ESP/RSP 的更新)。所以
leave
只是比单独做事情多了一个 uop。我已经在 Skylake 上对此进行了测试,将循环中的调用/ret 与设置传统帧指针的函数进行比较,并使用mov
/pop
或离开
。当您使用leave 时,uops_issued.any
的perf
计数器比mov/pop 时多显示一个前端uop。 (我运行了自己的测试,以防其他测量方法在其离开测量中计算堆栈同步微指令,但在实际功能控制中使用它。)旧 CPU 可能受益更多的可能原因是保持 mov / pop 分割up:
在大多数没有 uop 缓存的 CPU 中(即 Sandybridge 之前的 Intel、Zen 之前的 AMD),多 uop 指令可能会成为解码瓶颈。它们只能在第一个(“复杂”)解码器中解码,因此可能意味着之前的解码周期产生的微指令比正常情况少。
一些 Windows 调用约定是被调用者弹出堆栈参数,使用
ret n
。 (例如ret 8
在弹出返回地址后执行 ESP/RSP += 8)。这是一条多 uop 指令,与现代 x86 上的普通近ret
指令不同。所以上面的原因是双重的:leave 和 ret 12 无法在同一周期内解码这些原因也适用于构建 uop 缓存条目的传统解码。
P5 Pentium 还更喜欢 x86 的类似 RISC 的子集,甚至无法将复杂的指令分解为单独的微指令根本。
对于现代 CPU,
leave
在 uop 缓存中占用 1 个额外的 uop。并且所有 3 个必须位于 uop 缓存的同一行中,这可能导致仅部分填充前一行。因此,更大的 x86 代码大小实际上可以改善 uop 缓存的打包。或不,取决于事情如何排列。节省 2 个字节(或 64 位模式下为 3 个字节)可能值得也可能不值得每个函数 1 个额外的 uop。
GCC 支持
leave
,clang 和 MSVC 支持mov
/pop
(即使使用clang -Oz
代码大小优化以牺牲速度为代价,例如执行诸如push 1 / pop rax
(3 字节)而不是5 字节mov eax,1
)之类的操作。ICC 支持 mov/pop,但使用
-Os
将使用leave
。 https://godbolt.org/z/95EnP3G1fenter
is unusably slow on all CPUs, nobody uses it except maybe for code-size optimization at the expense of speed. (If a frame pointer is needed at all, or desired to allow more compact addressing modes for addressing stack space.)leave
is fast enough to be worth using, and GCC does use it (if ESP / RSP isn't already pointing at a saved EBP/RBP; otherwise it just usespop ebp
).leave
is only 3 uops on modern Intel CPUs (and 2 on some AMD). (https://agner.org/optimize/, https://uops.info/).mov / pop is only 2 uops total (on modern x86 where a "stack engine" tracks updates to ESP/RSP). So
leave
is just one more uop than doing things separately. I've tested this on Skylake, comparing a call/ret in a loop with the function setting up a traditional frame pointer and tearing down its stack frame usingmov
/pop
orleave
.perf
counters foruops_issued.any
shows one more front-end uop when you use leave than for mov/pop. (I ran my own test in case other measurement methods has been counting a stack-sync uop in their leave measurements, but using it in a real function controls for that.)Possible reasons why older CPUs might have benefited more keeping mov / pop split up:
In most CPUs without a uop cache (i.e. Intel before Sandybridge, AMD before Zen), multi-uop instructions can be a decode bottleneck. They can only decode in the first ("complex") decoder, so might mean the decode cycle before that produced fewer uops than normal.
Some Windows calling conventions are callee-pops stack args, using
ret n
. (e.g.ret 8
to do ESP/RSP += 8 after popping the return address). This is a multi-uop instruction, unlike plain nearret
on modern x86. So the above reason goes double: leave andret 12
couldn't decode in the same cycleThose reasons also apply to legacy decode to build uop-cache entries.
P5 Pentium also preferred a RISC-like subset of x86, being unable to even break up complex instructions into separate uops at all.
For modern CPUs,
leave
takes up 1 extra uop in the uop cache. And all 3 have to be in the same line of the uop cache, which could lead to only partial filling of the previous line. So larger x86 code size could actually improve packing into the uop cache. Or not, depending on how things line up.Saving 2 bytes (or 3 in 64-bit mode) may or may not be worth 1 extra uop per function.
GCC favours
leave
, clang and MSVC favourmov
/pop
(even withclang -Oz
code-size optimization even at the expense of speed, e.g. doing stuff likepush 1 / pop rax
(3 bytes) instead of 5-bytemov eax,1
).ICC favours mov/pop, but with
-Os
will useleave
. https://godbolt.org/z/95EnP3G1f使用它们中的任何一个都没有真正的速度优势,尽管长方法可能会运行得更好,因为现在的CPU对更通用的更短更简单的指令进行了更“优化”(而且它允许执行饱和)如果你幸运的话)。
LEAVE
(仍在使用,只需查看Windows dll)的优点是它比手动拆除堆栈帧要小,这在空间有限时有很大帮助。英特尔指令手册(准确地说是第 2A 卷)将在指令上提供更多具体细节,因此 Agner Fogs 博士也应该如此优化手册
There is no real speed advantage using either of them, though the long method will probably run better due to the fact CPU's these days are more 'optimized' to the shorter simpler instructions that are more generic in use (plus it allows saturation of the execution ports if your lucky).
The advantage of
LEAVE
(which is still used, just see the windows dlls) is that its smaller than manually tearing down a stack frame, this helps a lot when your space is limited.The Intel instruction manuals (volume 2A to be precise) will have more nitty gritty details on the instructions, so should Dr Agner Fogs Optimization manuals