“输入”对比“push ebp”移动 ebp，特别是；子esp，imm”和“离开”与“mov esp, ebp;”比较弹出ebp”

发布于 2024-11-06 08:27:38 字数 227 浏览 8 评论 0原文

enter 和指令有什么区别

push ebp
mov  ebp, esp
sub  esp, imm

？有性能差异吗？如果是这样，哪个更快？为什么编译器总是使用后者？

与 leave 和

mov  esp, ebp
pop  ebp

说明类似。

原文

What is the difference between the enter and

push ebp
mov  ebp, esp
sub  esp, imm

instructions? Is there a performance difference? If so, which is faster and why do compilers always use the latter?

Similarly with the leave and

mov  esp, ebp
pop  ebp

instructions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽梦忆笙歌 2024-11-13 08:27:38

性能存在差异，尤其是对于 enter 而言。在现代处理器上，这会解码为大约 10 到 20 µops，而三个指令序列约为 4 到 6 个，具体取决于架构。有关详细信息，请参阅 Agner Fog 的说明表。

此外，与三个指令序列的 3 个时钟依赖链相比，enter 指令通常具有相当高的延迟，例如 core2 上的 8 个时钟。

此外，编译器可以出于调度目的而展开这三个指令序列，当然这取决于周围的代码，以允许指令的更多并行执行。

回复收藏 0 原文

莳間冲淡了誓言ζ 2024-11-13 08:27:38

在设计 80286 时，Intel 的 CPU 设计人员决定添加两条指令来帮助维护显示。

这里是 CPU 内部的微代码：

; ENTER Locals, LexLevel

push    bp              ;Save dynamic link.
mov     tempreg, sp     ;Save for later.
cmp     LexLevel, 0     ;Done if this is lex level zero.
je      Lex0

lp:
dec     LexLevel
jz      Done            ;Quit if at last lex level.
sub     bp, 2           ;Index into display in prev act rec
push    [bp]            ; and push each element there.
jmp     lp              ;Repeat for each entry.

Done:
push    tempreg         ;Add entry for current lex level.

Lex0:
mov     bp, tempreg     ;Ptr to current act rec.
sub     sp, Locals      ;Allocate local storage

ENTER 的替代方案是：

;输入n, 0;在486上输入14个周期

push    bp              ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

;在 486 上输入 n, 1 ;17 个周期

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 2           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

;输入 n, 3 ;在 486

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
push    [bp-4]          ;4 cycles on the 486
push    [bp-6]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 6           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

等上输入 23 个周期。长距离可能会增加文件大小，但速度更快。

最后一点，程序员不再真正使用显示，因为这是一种非常缓慢的解决方法，使得 ENTER 现在变得毫无用处。

资料来源：https://courses.engr.illinois。 edu/ece390/books/artofasm/CH12/CH12-3.html

When designing the 80286, Intel's CPU designers decided to add two instructions to help maintain displays.

Here the micro code inside the CPU:

; ENTER Locals, LexLevel

push    bp              ;Save dynamic link.
mov     tempreg, sp     ;Save for later.
cmp     LexLevel, 0     ;Done if this is lex level zero.
je      Lex0

lp:
dec     LexLevel
jz      Done            ;Quit if at last lex level.
sub     bp, 2           ;Index into display in prev act rec
push    [bp]            ; and push each element there.
jmp     lp              ;Repeat for each entry.

Done:
push    tempreg         ;Add entry for current lex level.

Lex0:
mov     bp, tempreg     ;Ptr to current act rec.
sub     sp, Locals      ;Allocate local storage

Alternative to ENTER would be:

; enter n, 0 ;14 cycles on the 486

push    bp              ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

; enter n, 1 ;17 cycles on the 486

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 2           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

; enter n, 3 ;23 cycles on the 486

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
push    [bp-4]          ;4 cycles on the 486
push    [bp-6]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 6           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

Etc. The long way might increase your file size, but is way quicker.

One last note, programmer don't really use display anymore since that was a very slow work around, making ENTER pretty useless now.

Source: https://courses.engr.illinois.edu/ece390/books/artofasm/CH12/CH12-3.html

回复收藏 0 原文

栖迟 2024-11-13 08:27:38

enter 在所有 CPU 上都慢得无法使用，除了可能以牺牲速度为代价的代码大小优化之外，没有人使用它。（如果确实需要帧指针，或者希望允许更紧凑的寻址模式来寻址堆栈空间。）

leave足够快，值得使用< /strong>，并且 GCC 确实使用它（如果 ESP / RSP 尚未指向已保存的 EBP/RBP；否则它只使用 pop ebp）。

leave 在现代 Intel CPU 上仅为 3 uops（在某些 AMD 上为 2 uops）。（https://agner.org/optimize/，https://uops.info/)。

mov / pop 总共只有 2 uops（在现代 x86 上，其中“堆栈引擎”跟踪 ESP/RSP 的更新）。所以 leave 只是比单独做事情多了一个 uop。我已经在 Skylake 上对此进行了测试，将循环中的调用/ret 与设置传统帧指针的函数进行比较，并使用 mov/pop 或离开。当您使用leave 时，uops_issued.any 的perf 计数器比mov/pop 时多显示一个前端uop。（我运行了自己的测试，以防其他测量方法在其离开测量中计算堆栈同步微指令，但在实际功能控制中使用它。）

旧 CPU 可能受益更多的可能原因是保持 mov / pop 分割up:

在大多数没有 uop 缓存的 CPU 中（即 Sandybridge 之前的 Intel、Zen 之前的 AMD），多 uop 指令可能会成为解码瓶颈。它们只能在第一个（“复杂”）解码器中解码，因此可能意味着之前的解码周期产生的微指令比正常情况少。
一些 Windows 调用约定是被调用者弹出堆栈参数，使用 ret n。（例如 ret 8 在弹出返回地址后执行 ESP/RSP += 8）。这是一条多 uop 指令，与现代 x86 上的普通近 ret 指令不同。所以上面的原因是双重的：leave 和 ret 12 无法在同一周期内解码
这些原因也适用于构建 uop 缓存条目的传统解码。
P5 Pentium 还更喜欢 x86 的类似 RISC 的子集，甚至无法将复杂的指令分解为单独的微指令根本。

对于现代 CPU，leave 在 uop 缓存中占用 1 个额外的 uop。并且所有 3 个必须位于 uop 缓存的同一行中，这可能导致仅部分填充前一行。因此，更大的 x86 代码大小实际上可以改善 uop 缓存的打包。或不，取决于事情如何排列。

节省 2 个字节（或 64 位模式下为 3 个字节）可能值得也可能不值得每个函数 1 个额外的 uop。

GCC 支持 leave，clang 和 MSVC 支持 mov/pop（即使使用 clang -Oz 代码大小优化以牺牲速度为代价，例如执行诸如push 1 / pop rax（3 字节）而不是5 字节mov eax,1）之类的操作。

ICC 支持 mov/pop，但使用 -Os 将使用 leave。 https://godbolt.org/z/95EnP3G1f

enter is unusably slow on all CPUs, nobody uses it except maybe for code-size optimization at the expense of speed. (If a frame pointer is needed at all, or desired to allow more compact addressing modes for addressing stack space.)

leave is fast enough to be worth using, and GCC does use it (if ESP / RSP isn't already pointing at a saved EBP/RBP; otherwise it just uses pop ebp).

leave is only 3 uops on modern Intel CPUs (and 2 on some AMD). (https://agner.org/optimize/, https://uops.info/).

mov / pop is only 2 uops total (on modern x86 where a "stack engine" tracks updates to ESP/RSP). So leave is just one more uop than doing things separately. I've tested this on Skylake, comparing a call/ret in a loop with the function setting up a traditional frame pointer and tearing down its stack frame using mov/pop or leave. perf counters for uops_issued.any shows one more front-end uop when you use leave than for mov/pop. (I ran my own test in case other measurement methods has been counting a stack-sync uop in their leave measurements, but using it in a real function controls for that.)

Possible reasons why older CPUs might have benefited more keeping mov / pop split up:

In most CPUs without a uop cache (i.e. Intel before Sandybridge, AMD before Zen), multi-uop instructions can be a decode bottleneck. They can only decode in the first ("complex") decoder, so might mean the decode cycle before that produced fewer uops than normal.
Some Windows calling conventions are callee-pops stack args, using ret n. (e.g. ret 8 to do ESP/RSP += 8 after popping the return address). This is a multi-uop instruction, unlike plain near ret on modern x86. So the above reason goes double: leave and ret 12 couldn't decode in the same cycle
Those reasons also apply to legacy decode to build uop-cache entries.
P5 Pentium also preferred a RISC-like subset of x86, being unable to even break up complex instructions into separate uops at all.

For modern CPUs, leave takes up 1 extra uop in the uop cache. And all 3 have to be in the same line of the uop cache, which could lead to only partial filling of the previous line. So larger x86 code size could actually improve packing into the uop cache. Or not, depending on how things line up.

Saving 2 bytes (or 3 in 64-bit mode) may or may not be worth 1 extra uop per function.

GCC favours leave, clang and MSVC favour mov/pop (even with clang -Oz code-size optimization even at the expense of speed, e.g. doing stuff like push 1 / pop rax (3 bytes) instead of 5-byte mov eax,1).

ICC favours mov/pop, but with -Os will use leave. https://godbolt.org/z/95EnP3G1f

回复收藏 0 原文