x86 上哪个写屏障更好:lock+addl 或 xchgl?
Linux内核使用锁; addl $0,0(%%esp)
作为写屏障,而 RE2 库使用 xchgl (%0),%0
作为写屏障。有什么区别,哪个更好?
x86 也需要读屏障指令吗? RE2 将其读屏障函数定义为 x86 上的无操作,而 Linux 将其定义为 lfence 或无操作,具体取决于 SSE2 是否可用。什么时候需要lfence
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
引用 IA32 手册(第 3A 卷,第 8.2 章:内存排序):
注意:上面的“在单处理器系统中”有点误导。相同的规则分别适用于每个(逻辑)处理器;然后,手册继续描述多个处理器之间的附加排序规则。与这个问题有关的唯一一点是
简而言之,只要您写入回写内存(只要您不是驱动程序或图形程序员,您就会看到所有内存),大多数 x86 指令几乎顺序一致 - 唯一的重新排序x86 CPU 可以执行重新排序稍后(独立)读取以在写入之前执行。关于写屏障的主要问题是它们有一个
lock
前缀(隐式或显式),它禁止所有重新排序并确保多处理器中的所有处理器以相同的顺序看到操作系统。此外,在回写存储器中,读取永远不会重新排序,因此不需要读取屏障。最新的 x86 处理器对于流存储和写组合内存(通常用于映射图形内存)具有较弱的内存一致性模型。这就是各种
fence
指令发挥作用的地方;它们对于任何其他内存类型都不是必需的,但 Linux 内核中的某些驱动程序确实处理写组合内存,因此它们只是以这种方式定义其读屏障。每个存储器类型的订购模型列表位于第 11.3.1 卷第 11.3.1 节中。 IA-32 手册的 3A。简短版本:直写式、回写式和写保护式允许推测性读取(遵循上面详述的规则),不可缓存和强不可缓存内存具有强大的排序保证(没有处理器重新排序,读/写立即执行,用于 MMIO )和写入组合内存的排序较弱(即需要栅栏的宽松排序规则)。Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):
Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that
In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a
lock
prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various
fence
instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).如果我们测试 (%%esp) 地址处的锁变量的 0 状态,“lock; addl $0,0(%%esp)”会更快。因为我们向锁定变量添加 0 值,并且如果地址 (%%esp) 处的变量锁定值为 0,则零标志设置为 1。
来自 Intel 数据表的 lfence:
(编者注:
mfence
或锁定操作是唯一有用的栅栏(在存储之后),以实现顺序一致性
。lfence
不会阻止存储缓冲区对 StoreLoad 重新排序。 )例如:如果正确对齐,像“mov”这样的内存写入指令是原子的(它们不需要锁前缀)。但该指令通常在CPU缓存中执行,此时对于所有其他线程来说不会是全局可见的,因为必须首先执行内存栅栏以使该线程等待,直到先前的存储对其他线程可见。
因此,这两个指令之间的主要区别在于 xchgl 指令不会对条件标志产生任何影响。当然,我们可以使用lock cmpxchg指令来测试锁定变量状态,但这仍然比使用lock add $0指令更复杂。
The "lock; addl $0,0(%%esp)" is faster in case that we testing the 0 state of lock variable at (%%esp) address. Because we add 0 value to lock variable and the zero flag is set to 1 if the lock value of variable at address (%%esp) is 0.
lfence from Intel datasheet:
(Editor's note:
mfence
or alock
ed operation is the only useful fence (after a store) for sequential consistency.lfence
does not block StoreLoad reordering by the store buffer.)For instance: memory write instruction like 'mov' are atomic (they don't need lock prefix) if they are properly aligned. But this instruction is normally executed in CPU cache and will not be globally visible at this moment for all other threads, because memory fence must be performed first to make this thread wait until previous stores are visible to other threads.
So the main difference between these two instructions is that xchgl instruction will not have any effect on the conditional flags. Certainly we can test the lock variable state with lock cmpxchg instruction but this is still more complex than with lock add $0 instruction.
lock addl $0, (%esp)
是mfence
的替代品,而不是lfence
。(
lock add
在现代 CPU 上通常更快,尤其是具有更新微码的 Intel Skylake 其中mfence
的作用与lfence
一样,甚至阻止寄存器上指令的无序执行。这就是为什么 GCC 最近在需要完整屏障时转而使用虚拟lock add
而不是mfence
也相关:On The Fence With Dependency,Aleksey Shipilëv 2014 有一些使用-8(%rsp) 的微基准测试
与0(%rsp)
作为虚拟操作的目标。)用例是当您需要阻止 StoreLoad 重新排序时(x86 的强内存模型允许的唯一一种) ,但您不需要对共享变量进行原子 RMW 操作。 https://preshing.com/20120515/memory-reordering-caught -in-the-act/
例如假设对齐
std::atomic; a,b
,其中默认的 memory_order 为seq_cst
您的选项是:
使用
xchg
进行顺序一致性存储 ,例如mov $1, %eax
/xchg %eax, a
因此您不需要单独的屏障;它是商店的一部分。我认为这是大多数现代硬件上最有效的选择;除 gcc 之外的 C++11 编译器使用 xchg 来存储 seq_cst。 (参见为什么 std::atomic具有顺序一致性的存储使用 XCHG? 回复:性能和正确性。)使用
mfence
作为屏障。 (gcc 使用mov
+mfence
进行 seq_cst 存储,但最近为了性能而改用xchg
。)使用
lock addl $0 , (%esp)
作为屏障。任何锁定指令都是完全屏障,但这条指令对除标志之外的寄存器或内存内容没有影响。请参阅lock xchg 是否与 mfence 具有相同的行为?(或者到其他位置,但堆栈在 L1d 中几乎总是私有且热的,因此它是一个很好的候选者。后来使用该空间的任何内容的重新加载都无法在原子 RMW 之后开始,因为它是一个完整的障碍。)
您只能通过将 xchg 折叠到存储中来使用 xchg 作为屏障,因为它无条件地使用不依赖于旧值的值写入内存位置。
如果可能,使用 xchg 进行 seq-cst 存储可能是最好的,即使它也从共享位置读取。
mfence
在最近的 Intel CPU 上比预期慢 (加载和存储是唯一被重新排序的指令吗?),也以相同的方式阻止独立非内存指令的无序执行lfence
确实如此。即使
mfence
可用,甚至可能值得使用lock addl $0, (%esp)/(%rsp)
而不是mfence
,但是我还没有尝试过缺点。使用-64(%rsp)
或其他东西可能会降低对热门数据(本地或返回地址)的数据依赖性,但这可能会让像 valgrind 这样的工具不高兴。lfence
对于内存排序来说从来没有用处,除非您使用 MOVNTDQA 加载从视频 RAM(或其他一些 WC 弱排序区域)读取数据。序列化乱序执行(但不是存储缓冲区)对于停止 StoreLoad 重新排序(x86 的强内存模型允许正常 WB(回写)内存区域的唯一一种)没有用处。
lfence 的实际用例是阻止 rdtsc 的无序执行,以对非常短的代码块进行计时,或者通过阻止猜测来缓解 Spectre条件分支或间接分支。
另请参阅 何时应使用 _mm_sfence _mm_lfence 和 _mm_mfence (我的答案和 @BeeOnRope 的答案),了解有关为什么 lfence 没有用以及何时使用每个屏障指令的更多信息。 (或者在我的例子中,使用 C++ 而不是 asm 编程时的 C++ 内在函数)。
lock addl $0, (%esp)
is a substitute formfence
, notlfence
.(
lock add
is generally faster on modern CPUs, especially Intel Skylake with updated microcode wheremfence
acts likelfence
as well, blocking out-of-order exec even of instructions on registers. That's why GCC recently switched to using a dummylock add
instead ofmfence
when it needs a full barrier. Also related: On The Fence With Dependencies, Aleksey Shipilëv 2014 has some microbenchmarks of using-8(%rsp)
vs.0(%rsp)
as the destination for a dummy op.)The use-case is when you need to block StoreLoad reordering (the only kind that x86's strong memory model allows), but you don't need an atomic RMW operation on a shared variable. https://preshing.com/20120515/memory-reordering-caught-in-the-act/
e.g. assuming aligned
std::atomic<int> a,b
, where the default memory_order isseq_cst
Your options are:
Do a sequential-consistency store with
xchg
, e.g.mov $1, %eax
/xchg %eax, a
so you don't need a separate barrier; it's part of the store. I think this is the most efficient option on most modern hardware; C++11 compilers other than gcc usexchg
for seq_cst stores. (See Why does a std::atomic store with sequential consistency use XCHG? re: performance and correctness.)Use
mfence
as a barrier. (gcc usedmov
+mfence
for seq_cst stores, but recently switched toxchg
for performance.)Use
lock addl $0, (%esp)
as a barrier. Anylock
ed instruction is a full barrier, but this one has no effect on register or memory contents except FLAGS. See Does lock xchg have the same behavior as mfence?(Or to some other location, but the stack is almost always private and hot in L1d, so it's a good candidate. Later reloads of whatever was using that space couldn't start until after the atomic RMW anyway because it's a full barrier.)
You can only use
xchg
as a barrier by folding it into a store because it unconditionally writes the memory location with a value that doesn't depend on the old value.When possible, using
xchg
for a seq-cst store is probably best, even though it also reads from the shared location.mfence
is slower than expected on recent Intel CPUs (Are loads and stores the only instructions that gets reordered?), also blocking out-of-order execution of independent non-memory instructions the same waylfence
does.It might even be worth using
lock addl $0, (%esp)/(%rsp)
instead ofmfence
even whenmfence
is available, but I haven't experimented with the downsides. Using-64(%rsp)
or something might make it less likely to lengthen a data dependency on something hot (a local or a return address), but that can make tools like valgrind unhappy.lfence
is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads.Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions).
The real-world use-cases for
lfence
are for blocking out-of-order execution ofrdtsc
for timing very short blocks of code, or for Spectre mitigation by blocking speculation through a conditional or indirect branch.See also When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer) for more about why
lfence
is not useful, and when to use each of the barrier instructions. (Or in mine, the C++ intrinsics when programming in C++ instead of asm).作为其他答案的旁白,HotSpot 开发人员发现
lock;零偏移的 addl $0,0(%%esp)
可能不是最佳的,在某些处理器上它可以 引入错误的数据依赖;相关 jdk 错误。在某些情况下,使用不同的偏移量接触堆栈位置可以提高性能。
As an aside to the other answers, the HotSpot devs found that
lock; addl $0,0(%%esp)
with a zero offset may not be optimal, on some processors it can introduce false data dependencies; related jdk bug.Touching a stack location with a different offset can improve performance under some circumstances.
lock的重要部分; addl
和xchgl
是lock
前缀。它对于xchgl
是隐式的。两者之间确实没有区别。我会看看他们如何组装并选择更短的(以字节为单位),因为对于 x86 上的等效操作通常更快(因此像xorl eax,eax
这样的技巧)SSE2 的存在可能只是真实条件的代理,最终是 cpuid 的函数。事实可能是,SSE2 暗示了 lfence 的存在,并且在启动时检查/缓存了 SSE2 的可用性。
lfence
在可用时是必需的。The important part of
lock; addl
andxchgl
is thelock
prefix. It's implicit forxchgl
. There is really no difference between the two. I'd look at how they assemble and choose the one that's shorter (in bytes) since that's usually faster for equivalent operations on x86 (hence tricks likexorl eax,eax
)The presence of SSE2 is probably just a proxy for the real condition which is ultimately a function of
cpuid
. It probably turns out that SSE2 implies the existence oflfence
and the availability of SSE2 was checked/cached at boot.lfence
is required when it's available.