x86 如何处理存储条件指令?

发布于 2024-08-10 02:36:16 字数 101 浏览 9 评论 0原文

我试图找出 x86 处理器在遇到存储条件指令时会做什么。例如,它是否会停止管道的前端并等待 ROB 缓冲区变空,然后再停止停止前端并执行 SC?基本上它是否迫使处理器变得非投机......

I am trying to find out what an x86 processor does when it encounters a store conditional instruction. For instance does it stall the front end of the pipeline and wait for the ROB buffer to become empty before it stops stalling the front end and execute the SC? Basically does it force the processor to become non speculative...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

人间☆小暴躁 2024-08-17 02:36:16

我猜您指的是 CMOVcc 说明。

我不了解较旧的 x86 处理器,但现代处理器(自从它们变得推测性且无序)将条件存储实现为:

old value = mem[dest address]
if (condition) 
    mem[dest address] = new value
else
    mem[dest address] = old value

条件部分可以在硬件中实现,如下所示:

      cond
    |\ |
----| \|
new |  \
    |   |    dest
    |   |---------
    |   |     |
  __|  /      |
 |  | /       |
 |  |/        |
 |____________|

因此无需打破推测。事实上,商店将会出现。该条件确定要写入的数据是旧值还是新值。

I'm guessing that you're referring to the CMOVcc instructions.

I don't know about older x86 processors, but modern ones (ever since they became speculative and out of order) implement conditional stores as:

old value = mem[dest address]
if (condition) 
    mem[dest address] = new value
else
    mem[dest address] = old value

The condition part can be implemented in hardware like this:

      cond
    |\ |
----| \|
new |  \
    |   |    dest
    |   |---------
    |   |     |
  __|  /      |
 |  | /       |
 |  |/        |
 |____________|

So there's no need to break speculation. A store will in fact take place. The condition determines if the data to be written will be the old value or a new one.

烙印 2024-08-17 02:36:16

与 ARM 和许多其他 RISC 不同,x86 没有 加载链接/存储-有条件;从架构上来说,它具有类似于原子 RMW 的 lock add byte [rdi], 1lock cmpxchg [rdi], ecx 之类的东西。请参阅在特定情况下递增 int 是否有效原子? 有关语义和 CPU 架构的一些详细信息。

另请参阅 LWARX 和 STWCX 的 x86 等效项 - 任意原子 RMW 操作可以通过 < a href="https://en.wikipedia.org/wiki/Compare-and-swap" rel="nofollow noreferrer">用 CAS 合成(lock cmpxchg)重试循环。与 LL/SC 不同,它容易受到 ABA 问题的影响,但 CAS 是为原子内容提供构建块的另一种主要方法。


在 x86 现代 CPU 内部,这可能是通过运行一个也“锁定”该缓存行的加载微指令来实现的。 锁”不是为了让以后的 SC 失败而装备监视器,而是在存储解锁之前阻止 MESI 响应,从而防止在 LL/SC 计算机上导致 SC 失败的情况。)

(“缓存 该行处于 MESI 修改状态(而不是传统的总线锁定)取决于它是可缓存内存,并且对齐或至少不跨缓存行边界分割。


x86 的 cmov 指令只有一种形式,带有寄存器目标,而不是内存。 cmovcc reg、reg/mem。即使有内存源,它也是无条件加载来提供 ALU 选择操作,因此即使条件为假,也会在错误地址上出现段错误。 (与 ARM 谓词指令不同,其中整个指令在错误条件下被 NOPed。)

我猜你可能会说 lock cmpxchg [mem], reg 是一个条件存储,但唯一可能的条件是是否内存的旧内容与 AL/AX/EAX/RAX 匹配。 https://www.felixcloutier.com/x86/cmpxchg

rep stosb/w /d/q 也是一个条件存储,如果您将 RCX 安排为 0 或 1(例如 xor ecx,ecx / set FLAGS / setcc cl );微代码分支不是分支预测的,因此它与正常分支有点不同。

AVX vmaskmovps 或 AVX-512 屏蔽存储是真正的条件存储,基于屏蔽条件。我在另一个问答中的回答关于 cmov 讨论了这些的条件加载等价物,以及 cmov 不是条件加载这一事实,它是需要所有 3 个条件的 ALU 选择输入(FLAGS 和 2 个整数)。

除了 LL/SC 对的 SC 部分之外,条件存储在大多数 ISA 中很少见。 32 位 ARM 是一个例外。请参阅为什么条件执行指令不出现在后来的 ARM 指令集中吗? 为什么 AArch64 放弃了它。


AVX 和 AVX-512 屏蔽存储不会阻碍管道。请参阅https://agner.org/optimize/https://uops.info/一些性能数据,加上英特尔的优化手册。它们抑制屏蔽元素上的错误。如果您在提交到 L1d 之前重新加载,则来自它们的存储转发可能会阻止该加载,但不会阻止整个管道。


英特尔 APX(高级性能扩展)为 sub 等传统整数指令添加了 REX2 和 EVEX 前缀,以及 cmov 的一些新编码,这些编码实际上确实抑制了错误条件下的加载故障,以及条件存储版本。他们使用助记符CFCMOVcc,即条件故障CMOV。英特尔最终决定进行需要 64 位模式的扩展,使用通过删除 BCD 和其他操作码而释放的一些编码空间。

据推测,硬件处理条件加载/存储的方式类似于 AVX-512 屏蔽。

Unlike ARM and many other RISCs, x86 doesn't have load-linked / store-conditional; architecturally it has stuff like lock add byte [rdi], 1 or lock cmpxchg [rdi], ecx for atomic RMW. See Is incrementing an int effectively atomic in specific cases? for some details of the semantics and CPU architecture.

See also x86 equivalent for LWARX and STWCX - arbitrary atomic RMW operations can by synthesized with a CAS (lock cmpxchg) retry loop. Unlike LL/SC, it is susceptible to ABA problems, but CAS is the other major way of providing a building block for atomic stuff.


Internally on x86 modern CPUs, this probably works by running a load uop that also "locks" that cache line. (Instead of arming a monitor so a later SC will fail, the "cache lock" prevents MESI responses until a store-unlock, preventing things that would have made an SC fail on an LL/SC machine.)

Taking a cache lock on just that line in MESI Modified state (instead of the traditional bus lock) depends on it being cacheable memory, and being aligned or at least not splitting across a cache-line boundary.


x86's cmov instruction only has one form, with a register destination, not memory. cmovcc reg, reg/mem. Even with a memory source, it's an unconditional load to feed an ALU select operation, so will segfault on a bad address even if the condition is false. (Unlike ARM predicated instructions, where the whole instruction is NOPed out on a false condition.)

I guess you could say lock cmpxchg [mem], reg is a conditional store, but the only condition possible is whether the old contents of memory match AL/AX/EAX/RAX. https://www.felixcloutier.com/x86/cmpxchg

rep stosb/w/d/q is also a conditional store, if you arrange for RCX to be 0 or 1 (e.g. xor ecx,ecx / set FLAGS / setcc cl); microcode branching isn't branch-predicted so it's a bit different from normal branching.

AVX vmaskmovps or AVX-512 masked stores are truly conditional stores, based on a mask condition. My answer on another Q&A about cmov discusses the conditional-load equivalents of these, along with the fact that cmov is not a conditional load, it's an ALU select that needs all 3 inputs (FLAGS and 2 integers).

Conditional stores are rare in most ISAs other than the SC part of a LL/SC pair. 32-bit ARM is an exception to the rule; see Why are conditionally executed instructions not present in later ARM instruction sets? for why AArch64 dropped it.


AVX and AVX-512 masked stores do not stall the pipeline. See https://agner.org/optimize/ and https://uops.info/ for some performance numbers, plus Intel's optimization manual. They suppress faults on masked elements. Store-forwarding from them if you reload before they commit to L1d might stall that load, but not the whole pipeline.


Intel APX (Advanced Performance Extensions) adds REX2 and EVEX prefixes for legacy integer instructions like sub, and some new encodings of cmov that actually do suppress faults on load with a false condition, and a conditional-store version. They use the mnemonic CFCMOVcc, Conditionally Faulting CMOV. Intel finally decided to make an extension that required 64-bit mode, using some of the coding space freed up by removing BCD and other opcodes.

Presumably the hardware handles conditional load/store similar to AVX-512 masking.

勿忘心安 2024-08-17 02:36:16

(通用)x86 处理器不会执行您提到的任何操作。它只是获取一条又一条指令并执行它们。

其他一切都是透明处理的,并且在很大程度上取决于您正在查看的处理器,因此您的问题没有通用答案。

如果您对解决停滞问题的方法感兴趣,您应该从 x86 上的维基百科页面开始(注册重命名以提及一个。只需丢弃未采用分支的结果)。

A (generic) x86 processor does none of the things you mentioned. It just fetches one instruction after another and executes them.

Everything else is handled transparently and heavily depends on which processor you are looking at, so there is no generic answer to your question.

If you are interested in methods around stalling problems you should start at the wikipedia page on x86 (register renaming to mention one. Just throw away results from the non-taken branch).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文