当前位置：文江博客话题详情

assembly x86 memory-barriers

x86 上哪个写屏障更好：lock+addl 或 xchgl？

发布于 2024-10-03 16:26:18 字数 227 浏览 8 评论 0 原文

Linux内核使用锁； addl $0,0(%%esp) 作为写屏障，而 RE2 库使用 xchgl (%0),%0 作为写屏障。有什么区别，哪个更好？

x86 也需要读屏障指令吗？ RE2 将其读屏障函数定义为 x86 上的无操作，而 Linux 将其定义为 lfence 或无操作，具体取决于 SSE2 是否可用。什么时候需要lfence？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

野鹿林 2024-10-10 16:26:18

引用 IA32 手册（第 3A 卷，第 8.2 章：内存排序）：

在定义为可回写缓存的内存区域的单处理器系统中，内存排序模型遵循以下原则 [..]

读取不会与其他读取重新排序

写入不会与较旧的读取重新排序

对内存的写入不会与其他写入重新排序，但以下情况除外

使用 CLFLUSH 指令执行写入

使用非临时移动指令（[此处的指令列表]）执行流式存储（写入）

字符串操作（请参阅第 8.2.4.1 节）

读取可能会与不同位置的旧写入重新排序，但不能与同一位置的旧写入重新排序。

无法使用 I/O 指令、锁定指令或序列化指令重新排序读取或写入

读取无法传递 LFENCE 和 MFENCE 指令

写入无法传递 SFENCE 和 MFENCE 指令

注意：上面的“在单处理器系统中”有点误导。相同的规则分别适用于每个（逻辑）处理器；然后，手册继续描述多个处理器之间的附加排序规则。与这个问题有关的唯一一点是

锁定的指令有一个总顺序。

简而言之，只要您写入回写内存（只要您不是驱动程序或图形程序员，您就会看到所有内存），大多数 x86 指令几乎顺序一致 - 唯一的重新排序x86 CPU 可以执行重新排序稍后（独立）读取以在写入之前执行。关于写屏障的主要问题是它们有一个lock前缀（隐式或显式），它禁止所有重新排序并确保多处理器中的所有处理器以相同的顺序看到操作系统。

此外，在回写存储器中，读取永远不会重新排序，因此不需要读取屏障。最新的 x86 处理器对于流存储和写组合内存（通常用于映射图形内存）具有较弱的内存一致性模型。这就是各种 fence 指令发挥作用的地方；它们对于任何其他内存类型都不是必需的，但 Linux 内核中的某些驱动程序确实处理写组合内存，因此它们只是以这种方式定义其读屏障。每个存储器类型的订购模型列表位于第 11.3.1 卷第 11.3.1 节中。 IA-32 手册的 3A。简短版本：直写式、回写式和写保护式允许推测性读取（遵循上面详述的规则），不可缓存和强不可缓存内存具有强大的排序保证（没有处理器重新排序，读/写立即执行，用于 MMIO ）和写入组合内存的排序较弱（即需要栅栏的宽松排序规则）。

Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles [..]

Reads are not reordered with other reads

Writes are not reordered with older reads

Writes to memory are not reordered with other writes, with the exception of

writes executed with the CLFLUSH instruction

streaming stores (writes) executed with the non-temporal move instructions ([list of instructions here])

string operations (see Section 8.2.4.1)

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions

Reads cannot pass LFENCE and MFENCE instructions

Writes cannot pass SFENCE and MFENCE instructions

Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that

Locked instructions have a total order.

In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a lock prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.

Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various fence instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).

回复收藏 0 原文

玩心态 2024-10-10 16:26:18

如果我们测试 (%%esp) 地址处的锁变量的 0 状态，“lock; addl $0,0(%%esp)”会更快。因为我们向锁定变量添加 0 值，并且如果地址 (%%esp) 处的变量锁定值为 0，则零标志设置为 1。

来自 Intel 数据表的 lfence：

执行序列化操作
所有从内存加载指令
在 LFENCE 之前发布
操作说明。这次连载
操作保证每个负载
程序中前面的指令
LFENCE 指令的顺序为
在任何加载之前全局可见
LFENCE 后面的指令
指令全局可见。

（编者注：mfence 或 锁定操作是唯一有用的栅栏（在存储之后），以实现顺序一致性。lfence不会阻止存储缓冲区对 StoreLoad 重新排序。）

例如：如果正确对齐，像“mov”这样的内存写入指令是原子的（它们不需要锁前缀）。但该指令通常在CPU缓存中执行，此时对于所有其他线程来说不会是全局可见的，因为必须首先执行内存栅栏以使该线程等待，直到先前的存储对其他线程可见。

因此，这两个指令之间的主要区别在于 xchgl 指令不会对条件标志产生任何影响。当然，我们可以使用lock cmpxchg指令来测试锁定变量状态，但这仍然比使用lock add $0指令更复杂。

回复收藏 0 原文

心病无药医 2024-10-10 16:26:18

lock addl $0, (%esp) 是 mfence 的替代品，而不是 lfence。

（lock add 在现代 CPU 上通常更快，尤其是具有更新微码的 Intel Skylake 其中 mfence 的作用与 lfence 一样，甚至阻止寄存器上指令的无序执行。这就是为什么 GCC 最近在需要完整屏障时转而使用虚拟 lock add 而不是 mfence 也相关：On The Fence With Dependency，Aleksey Shipilëv 2014 有一些使用 -8(%rsp) 的微基准测试 与 0(%rsp) 作为虚拟操作的目标。）

用例是当您需要阻止 StoreLoad 重新排序时（x86 的强内存模型允许的唯一一种），但您不需要对共享变量进行原子 RMW 操作。 https://preshing.com/20120515/memory-reordering-caught -in-the-act/

例如假设对齐 std::atomic; a,b，其中默认的 memory_order 为 seq_cst

movl   $1, a           # a = 1;    Atomic for aligned a
# barrier needed here between seq_cst store and later loads
movl   b, %eax         # tmp = b;  Atomic for aligned b

您的选项是：

使用 xchg 进行顺序一致性存储 ，例如 mov $1, %eax / xchg %eax, a 因此您不需要单独的屏障；它是商店的一部分。我认为这是大多数现代硬件上最有效的选择；除 gcc 之外的 C++11 编译器使用 xchg 来存储 seq_cst。（参见为什么 std::atomic具有顺序一致性的存储使用 XCHG？ 回复：性能和正确性。）
使用 mfence 作为屏障。（gcc 使用 mov + mfence 进行 seq_cst 存储，但最近为了性能而改用 xchg。）
使用 lock addl $0 , (%esp) 作为屏障。任何锁定指令都是完全屏障，但这条指令对除标志之外的寄存器或内存内容没有影响。请参阅lock xchg 是否与 mfence 具有相同的行为？

（或者到其他位置，但堆栈在 L1d 中几乎总是私有且热的，因此它是一个很好的候选者。后来使用该空间的任何内容的重新加载都无法在原子 RMW 之后开始，因为它是一个完整的障碍。）

您只能通过将 xchg 折叠到存储中来使用 xchg 作为屏障，因为它无条件地使用不依赖于旧值的值写入内存位置。

如果可能，使用 xchg 进行 seq-cst 存储可能是最好的，即使它也从共享位置读取。 mfence 在最近的 Intel CPU 上比预期慢 (加载和存储是唯一被重新排序的指令吗？），也以相同的方式阻止独立非内存指令的无序执行lfence 确实如此。

即使 mfence 可用，甚至可能值得使用 lock addl $0, (%esp)/(%rsp) 而不是 mfence，但是我还没有尝试过缺点。使用 -64(%rsp) 或其他东西可能会降低对热门数据（本地或返回地址）的数据依赖性，但这可能会让像 valgrind 这样的工具不高兴。

lfence 对于内存排序来说从来没有用处，除非您使用 MOVNTDQA 加载从视频 RAM（或其他一些 WC 弱排序区域）读取数据。

序列化乱序执行（但不是存储缓冲区）对于停止 StoreLoad 重新排序（x86 的强内存模型允许正常 WB（回写）内存区域的唯一一种）没有用处。

lfence 的实际用例是阻止 rdtsc 的无序执行，以对非常短的代码块进行计时，或者通过阻止猜测来缓解 Spectre条件分支或间接分支。

另请参阅何时应使用 _mm_sfence _mm_lfence 和 _mm_mfence （我的答案和 @BeeOnRope 的答案），了解有关为什么 lfence 没有用以及何时使用每个屏障指令的更多信息。（或者在我的例子中，使用 C++ 而不是 asm 编程时的 C++ 内在函数）。

lock addl $0, (%esp) is a substitute for mfence, not lfence.

(lock add is generally faster on modern CPUs, especially Intel Skylake with updated microcode where mfence acts like lfence as well, blocking out-of-order exec even of instructions on registers. That's why GCC recently switched to using a dummy lock add instead of mfence when it needs a full barrier. Also related: On The Fence With Dependencies, Aleksey Shipilëv 2014 has some microbenchmarks of using -8(%rsp) vs. 0(%rsp) as the destination for a dummy op.)

The use-case is when you need to block StoreLoad reordering (the only kind that x86's strong memory model allows), but you don't need an atomic RMW operation on a shared variable. https://preshing.com/20120515/memory-reordering-caught-in-the-act/

e.g. assuming aligned std::atomic<int> a,b, where the default memory_order is seq_cst

movl   $1, a           # a = 1;    Atomic for aligned a
# barrier needed here between seq_cst store and later loads
movl   b, %eax         # tmp = b;  Atomic for aligned b

Your options are:

Do a sequential-consistency store with xchg, e.g. mov $1, %eax / xchg %eax, a so you don't need a separate barrier; it's part of the store. I think this is the most efficient option on most modern hardware; C++11 compilers other than gcc use xchg for seq_cst stores. (See Why does a std::atomic store with sequential consistency use XCHG? re: performance and correctness.)
Use mfence as a barrier. (gcc used mov + mfence for seq_cst stores, but recently switched to xchg for performance.)
Use lock addl $0, (%esp) as a barrier. Any locked instruction is a full barrier, but this one has no effect on register or memory contents except FLAGS. See Does lock xchg have the same behavior as mfence?

(Or to some other location, but the stack is almost always private and hot in L1d, so it's a good candidate. Later reloads of whatever was using that space couldn't start until after the atomic RMW anyway because it's a full barrier.)

You can only use xchg as a barrier by folding it into a store because it unconditionally writes the memory location with a value that doesn't depend on the old value.

When possible, using xchg for a seq-cst store is probably best, even though it also reads from the shared location. mfence is slower than expected on recent Intel CPUs (Are loads and stores the only instructions that gets reordered?), also blocking out-of-order execution of independent non-memory instructions the same way lfence does.

It might even be worth using lock addl $0, (%esp)/(%rsp) instead of mfence even when mfence is available, but I haven't experimented with the downsides. Using -64(%rsp) or something might make it less likely to lengthen a data dependency on something hot (a local or a return address), but that can make tools like valgrind unhappy.

lfence is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads.

Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions).

The real-world use-cases for lfence are for blocking out-of-order execution of rdtsc for timing very short blocks of code, or for Spectre mitigation by blocking speculation through a conditional or indirect branch.

See also When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer) for more about why lfence is not useful, and when to use each of the barrier instructions. (Or in mine, the C++ intrinsics when programming in C++ instead of asm).