我什么时候应该使用 _mm_sfence _mm_lfence 和 _mm_mfence

发布于 2024-10-08 22:58:28 字数 147 浏览 7 评论 0原文

我读了《英特尔架构优化指南指南》。

但是,我仍然不知道何时应该使用

_mm_sfence()
_mm_lfence()
_mm_mfence()

有人可以解释在编写多线程代码时何时应该使用这些吗?

I read the "Intel Optimization guide Guide For Intel Architecture".

However, I still have no idea about when should I use

_mm_sfence()
_mm_lfence()
_mm_mfence()

Could anyone explain when these should be used when writing multi-threaded code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

少女净妖师 2024-10-15 22:58:28

如果您使用 NT 存储,您可能需要 _mm_sfence 甚至 _mm_mfence_mm_lfence 的用例更加晦涩难懂。

如果没有,只需使用 C++11 std::atomic 并让编译器担心控制内存排序的 asm 细节。


x86 有一个强有序的内存模型,但 C++ 有一个非常弱的内存模型(C 也一样)。 对于获取/释放语义,您只需要防止编译时重新排序。请参阅 Jeff Preshing 的编译时内存排序文章。

_mm_lfence_mm_sfence 确实具有必要的编译器屏障效果,但它们也会导致编译器发出无用的 lfencesfence asm 指令会使您的代码运行速度变慢。

当您不做任何需要 sfence 的晦涩的事情时,有更好的选项来控制编译时重新排序。

例如,GNU C/C++ asm("" ::: "memory") 是编译器障碍(由于 "memory",所有值都必须位于与抽象机匹配的内存中 clobber),但没有发出任何 asm 指令。

如果您使用的是 C++11 std::atomic,则只需执行 shared_var.store(tmp, std::memory_order_release) 即可。这保证在任何早期的 C 赋值之后变得全局可见,甚至对于非原子变量也是如此。

如果您正在推出自己的 C11 / C++11 版本 std::atomic_mm_mfence可能很有用,因为实际的 mfence 指令是获得顺序一致性的一种方法,即阻止后面的加载读取值,直到前面的存储变得全局可见。请参阅 Jeff Preshing 的行为中的内存重新排序

但请注意,在当前硬件上,mfence 似乎比使用锁定的原子 RMW 操作慢。例如,xchg [mem], eax 也是一个完整的屏障,但运行速度更快,并且会进行存储。在 Skylake 上,mfence 的实现方式甚至可以防止其后面的非内存指令的乱序执行。请参阅此答案的底部。

然而,在没有内联汇编的 C++ 中,内存屏障的选择更加有限(x86 CPU 有多少个内存屏障指令?)。 mfence 并不可怕,它是 gcc 和 clang 目前用来进行顺序一致性存储的方法。

不过,如果可能的话,认真地使用 C++11 std::atomic 或 C11 stdatomic ;它更容易使用,并且您可以为很多事情获得非常好的代码生成。或者在 Linux 内核中,已经有内联汇编的包装函数来提供必要的屏障。有时这只是编译器障碍,有时它也是一个 asm 指令,以获得比默认更强大的运行时排序。 (例如,对于完整的屏障)。


没有任何障碍会让您的商店更快地出现在其他线程中。他们所能做的就是延迟当前线程中的后续操作,直到更早的事情发生。 CPU 已经尝试尽快将待处理的非推测性存储提交到 L1d 缓存。


_mm_sfence 是迄今为止在 C++ 中实际手动使用的最有可能的障碍

_mm_sfence() 的主要用例是在一些 _mm_stream 存储之后,在设置其他线程将检查的标志之前。

有关 NT 存储与常规存储以及 x86 的更多信息,请参阅memcpy 的增强型 REP MOVSB内存带宽。对于写入绝对不会很快被重新读取的非常大的缓冲区(大于 L3 缓存大小),使用 NT 存储可能是一个好主意。

与普通存储不同,NT 存储是弱排序的,因此如果您关心将数据发布到另一个线程,则需要 sfence 如果不是 (你最终会从这个线程中阅读它们),然后你就不会了。或者,如果您在告诉另一个线程数据已准备好之前进行系统调用,那么这也是序列化。

使用 NT 存储时,需要 sfence (或其他一些屏障)来释放/获取同步。 C++11 std::atomic 实现让您自行决定如何保护 NT 存储,以便原子发布存储能够高效。

#include <atomic>
#include <immintrin.h>

struct bigbuf {
    int buf[100000];
    std::atomic<unsigned> buf_ready;
};

void producer(bigbuf *p) {
  __m128i *buf = (__m128i*) (p->buf);

  for(...) {
     ...
     _mm_stream_si128(buf,   vec1);
     _mm_stream_si128(buf+1, vec2);
     _mm_stream_si128(buf+2, vec3);
     ...
  }

  _mm_sfence();    // All weakly-ordered memory shenanigans stay above this line
  // So we can safely use normal std::atomic release/acquire sync for buf
  p->buf_ready.store(1, std::memory_order_release);
}

然后消费者可以安全地执行 if(p->buf_ready.load(std::memory_order_acquire)) { foo = p->buf[0]; ... } 没有任何数据争用未定义行为。阅读器端不需要需要_mm_lfence; NT 存储的弱有序性质完全局限于进行写入的核心。一旦它变得全局可见,它就完全一致并按照正常规则排序。

其他用例包括命令 clflushopt 来控制存储到内存映射非易失性存储的数据的顺序。 (例如,现在已经存在使用 Optane 内存的 NVDIMM,或带有电池供电 DRAM 的 DIMM。)


_mm_lfence 几乎从不用作实际的负载栅栏。当从 WC(写组合)内存区域(如视频 RAM)加载时,加载只能弱排序。即使 movntdqa (_mm_stream_load_si128) 在普通(WB = 回写)内存上仍然是强排序的,并且不会采取任何措施来减少缓存污染。 (prefetchnta 可能,但它很难调整,并且会使事情变得更糟。)

TL:DR:如果您不编写图形驱动程序或其他直接映射视频 RAM 的东西,那么您就不会不需要 _mm_lfence 来排序负载。

lfence 确实具有有趣的微架构效果,可以防止执行后续指令,直到它退出。例如,当早期工作仍在微基准测试中时,阻止 _rdtsc() 读取周期计数器。 (始终适用于 Intel CPU,但仅适用于具有 MSR 设置的 AMD:Is LFENCE 序列化于AMD 处理器?。否则,lfence 在 Bulldozer 系列上每个时钟运行 4 个,因此显然没有序列化。)

由于您使用的是 C/C++ 中的内在函数,因此编译器正在为您生成代码。您无法直接控制 asm,但您可能会使用 _mm_lfence 来实现 Spectre 缓解之类的功能,前提是您可以让编译器将其放在 asm 输出中的正确位置:紧接在条件分支,在双数组访问之前。 (如foo[bar[i]])。如果您使用 Spectre 的内核补丁,我认为内核会保护您的进程免受其他进程的影响,因此您只需在使用 JIT 沙箱的程序中担心这一点,并担心受到来自其自身内部的攻击沙箱。

If you're using NT stores, you might want _mm_sfence or maybe even _mm_mfence. The use-cases for _mm_lfence are much more obscure.

If not, just use C++11 std::atomic and let the compiler worry about the asm details of controlling memory ordering.


x86 has a strongly-ordered memory model, but C++ has a very weak memory model (same for C). For acquire/release semantics, you only need to prevent compile-time reordering. See Jeff Preshing's Memory Ordering At Compile Time article.

_mm_lfence and _mm_sfence do have the necessary compiler-barrier effect, but they will also cause the compiler to emit a useless lfence or sfence asm instruction that makes your code run slower.

There are better options for controlling compile-time reordering when you aren't doing any of the obscure stuff that would make you want sfence.

For example, GNU C/C++ asm("" ::: "memory") is a compiler barrier (all values have to be in memory matching the abstract machine because of the "memory" clobber), but no asm instructions are emitted.

If you're using C++11 std::atomic, you can simply do shared_var.store(tmp, std::memory_order_release). That's guaranteed to become globally visible after any earlier C assignments, even to non-atomic variables.

_mm_mfence is potentially useful if you're rolling your own version of C11 / C++11 std::atomic, because an actual mfence instruction is one way to get sequential consistency, i.e. to stop later loads from reading a value until after preceding stores become globally visible. See Jeff Preshing's Memory Reordering Caught in the Act.

But note that mfence seems to be slower on current hardware than using a locked atomic-RMW operation. e.g. xchg [mem], eax is also a full barrier, but runs faster, and does a store. On Skylake, the way mfence is implemented prevents out-of-order execution of even non-memory instruction following it. See the bottom of this answer.

In C++ without inline asm, though, your options for memory barriers are more limited (How many memory barriers instructions does an x86 CPU have?). mfence isn't terrible, and it is what gcc and clang currently use to do sequential-consistency stores.

Seriously just use C++11 std::atomic or C11 stdatomic if possible, though; It's easier to use and you get quite good code-gen for a lot of things. Or in the Linux kernel, there are already wrapper functions for inline asm for the necessary barriers. Sometimes that's just a compiler barrier, sometimes it's also an asm instruction to get stronger run-time ordering than the default. (e.g. for a full barrier).


No barriers will make your stores appear to other threads any faster. All they can do is delay later operations in the current thread until earlier things happen. The CPU already tries to commit pending non-speculative stores to L1d cache as quickly as possible.


_mm_sfence is by far the most likely barrier to actually use manually in C++

The main use-case for _mm_sfence() is after some _mm_stream stores, before setting a flag that other threads will check.

See Enhanced REP MOVSB for memcpy for more about NT stores vs. regular stores, and x86 memory bandwidth. For writing very large buffers (larger than L3 cache size) that definitely won't be re-read any time soon, it can be a good idea to use NT stores.

NT stores are weakly-ordered, unlike normal stores, so you need sfence if you care about publishing the data to another thread. If not (you'll eventually read them from this thread), then you don't. Or if you make a system call before telling another thread the data is ready, that's also serializing.

sfence (or some other barrier) is necessary to give you release/acquire synchronization when using NT stores. C++11 std::atomic implementations leave it up to you to fence your NT stores, so that atomic release-stores can be efficient.

#include <atomic>
#include <immintrin.h>

struct bigbuf {
    int buf[100000];
    std::atomic<unsigned> buf_ready;
};

void producer(bigbuf *p) {
  __m128i *buf = (__m128i*) (p->buf);

  for(...) {
     ...
     _mm_stream_si128(buf,   vec1);
     _mm_stream_si128(buf+1, vec2);
     _mm_stream_si128(buf+2, vec3);
     ...
  }

  _mm_sfence();    // All weakly-ordered memory shenanigans stay above this line
  // So we can safely use normal std::atomic release/acquire sync for buf
  p->buf_ready.store(1, std::memory_order_release);
}

Then a consumer can safely do if(p->buf_ready.load(std::memory_order_acquire)) { foo = p->buf[0]; ... } without any data-race Undefined Behaviour. The reader side does not need _mm_lfence; the weakly-ordered nature of NT stores is confined entirely to the core doing the writing. Once it becomes globally visible, it's fully coherent and ordered according to the normal rules.

Other use-cases include ordering clflushopt to control the order of data being stored to memory-mapped non-volatile storage. (e.g. an NVDIMM using Optane memory, or DIMMs with battery-backed DRAM exist now.)


_mm_lfence is almost never useful as an actual load fence. Loads can only be weakly ordered when loading from WC (Write-Combining) memory regions, like video ram. Even movntdqa (_mm_stream_load_si128) is still strongly ordered on normal (WB = write-back) memory, and doesn't do anything to reduce cache pollution. (prefetchnta might, but it's hard to tune and can make things worse.)

TL:DR: if you aren't writing graphics drivers or something else that maps video RAM directly, you don't need _mm_lfence to order your loads.

lfence does have the interesting microarchitectural effect of preventing execution of later instructions until it retires. e.g. to stop _rdtsc() from reading the cycle-counter while earlier work is still pending in a microbenchmark. (Applies always on Intel CPUs, but on AMD only with an MSR setting: Is LFENCE serializing on AMD processors?. Otherwise lfence runs 4 per clock on Bulldozer family, so clearly not serializing.)

Since you're using intrinsics from C/C++, the compiler is generating code for you. You don't have direct control over the asm, but you might possibly use _mm_lfence for things like Spectre mitigation if you can get the compiler to put it in the right place in the asm output: right after a conditional branch, before a double array access. (like foo[bar[i]]). If you're using kernel patches for Spectre, I think the kernel will defend your process from other processes, so you'd only have to worry about this in a program that uses a JIT sandbox and is worried about being attacked from within its own sandbox.

王权女流氓 2024-10-15 22:58:28

这是我的理解,希望准确且足够简单以有意义:

(Itanium)IA64架构允许以任何顺序执行内存读取和写入,因此从另一个处理器的角度来看内存变化的顺序是不可预测的,除非您使用栅栏强制写入以合理的顺序完成。

从这里开始,我说x86,x86是强有序的。

在 x86 上,Intel 不保证在另一个处理器上完成的存储始终在此处理器上立即可见。该处理器可能很早就执行了加载(读取),从而错过了其他处理器的存储(写入)。它仅保证写入对其他处理器可见的顺序是按照程序顺序。无论您做什么,它都不保证其他处理器会立即看到任何更新。

锁定的读/修改/写指令完全顺序一致。因此,通常您已经处理了丢失其他处理器的内存操作,因为锁定的 xchgcmpxchg 会将其全部同步,您将获取相关的缓存行的所有权立即并将自动更新它。如果另一个 CPU 正在与您的锁定操作竞赛,则要么您将赢得比赛,而另一个 CPU 将错过缓存并在锁定操作后将其取回,要么他们将赢得比赛,而您将错过缓存并获得更新的内容来自他们的价值。

lfence 停止指令发出,直到 lfence 之前的所有指令完成。 mfence 特别等待所有先前的内存读取完全进入目标寄存器,并等待所有先前的写入变得全局可见,但不会像 lfence 那样停止所有进一步的指令> 会的。 sfence 仅对存储执行相同的操作,刷新写入组合器,并确保在允许 sfence 之后的任何存储之前,sfence 之前的所有存储都是全局可见的> 开始执行。

x86 上很少需要任何类型的栅栏,除非您使用写组合内存或非临时指令,否则它们不是必需的,如果您不是内核模式(驱动程序)开发人员,则很少会这样做。通常,x86 保证所有存储按程序顺序可见,但它不能保证 WC(写组合)内存或执行显式弱有序存储的“非临时”指令,例如 movnti 。代码>.

因此,总而言之,存储始终按程序顺序可见,除非您使用了特殊的弱顺序存储或正在访问 WC 内存类型。使用锁定指令(如 xchg、xadd 或 cmpxchg 等)的算法将在没有栅栏的情况下工作,因为锁定指令是顺序一致的。

Here is my understanding, hopefully accurate and simple enough to make sense:

(Itanium) IA64 architecture allows memory reads and writes to be executed in any order, so the order of memory changes from the point of view of another processor is not predictable unless you use fences to enforce that writes complete in a reasonable order.

From here on, I am talking about x86, x86 is strongly ordered.

On x86, Intel does not guarantee that a store done on another processor will always be immediately visible on this processor. It is possible that this processor speculatively executed the load (read) just early enough to miss the other processor's store (write). It only guarantees the order that writes become visible to other processors is in program order. It does not guarantee that other processors will immediately see any update, no matter what you do.

Locked read/modify/write instructions are fully sequentially consistent. Because of this, in general you already handle missing the other processor's memory operations because a locked xchg or cmpxchg will sync it all up, you will acquire the relevant cache line for ownership immediately and will update it atomically. If another CPU is racing with your locked operation, either you will win the race and the other CPU will miss the cache and get it back after your locked operation, or they will win the race, and you will miss the cache and get the updated value from them.

lfence stalls instruction issue until all instructions before the lfence are completed. mfence specifically waits for all preceding memory reads to be brought fully into the destination register, and waits for all preceding writes to become globally visible, but does not stall all further instructions as lfence would. sfence does the same for only stores, flushes write combiner, and ensures that all stores preceding the sfence are globally visible before allowing any stores following the sfence to begin execution.

Fences of any kind are rarely needed on x86, they are not necessary unless you are using write-combining memory or non-temporal instructions, something you rarely do if you are not a kernel mode (driver) developer. Normally, x86 guarantees that all stores are visible in program order, but it does not make that guarantee for WC (write combining) memory or for "non-temporal" instructions that do explicit weakly ordered stores, such as movnti.

So, to summarize, stores are always visible in program order unless you have used special weakly ordered stores or are accessing WC memory type. Algorithms using locked instructions like xchg, or xadd, or cmpxchg, etc, will work without fences because locked instructions are sequentially consistent.

错々过的事 2024-10-15 22:58:28

您提到的所有内部调用只需插入一个sfencelfencemfence 指令被调用时。那么问题就变成了“这些围栏指令的目的是什么”?

简而言之,lfence 完全没用*sfence 对于 x86 中用户模式程序的内存排序几乎完全没有用处。另一方面,mfence 充当完整的内存屏障,因此如果附近还没有一些 lock 前缀,您可以在需要屏障的地方使用它提供您需要的说明。

更长但仍然简短的答案是...

lfence

lfence 被记录为在 lfence 之前的加载相对于之后的加载进行排序,但是已经提供了此保证完全没有任何围栏的正常负载:也就是说,英特尔已经保证“负载不会与其他负载重新排序”。实际上,这使得用户模式代码中 lfence 的用途成为无序执行屏障,也许对于仔细计时某些操作很有用。

sfence

sfence 被记录为在之前和之后对存储进行排序,其方式与 lfence 对加载的作用相同,但就像加载一样,在大多数情况下英特尔已经保证了存储顺序。主要有趣的情况是所谓的非临时存储,例如 movntdqmovntimaskmovq 和其他一些说明。这些指令不遵循正常的内存排序规则,因此您可以在这些存储和要强制执行相对顺序的任何其他存储之间放置一个sfencemfence 也适用于此目的,但 sfence 速度更快。

mfence

与其他两个不同,mfence 实际上做了一些事情:它充当完整的内存屏障,确保之前的所有加载和存储都将在任何一个之前完成1随后的加载或存储开始执行。这个答案太短,无法完全解释内存屏障的概念,但一个例子是 Dekker 算法,其中每个想要进入临界区的线程都存储到一个位置,然后检查另一个线程是否已将某些内容存储到其位置。例如,在线程 1 上:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

在 x86 上,您需要在存储(第一个 mov)和加载(第二个 mov)之间设置内存屏障,否则,每个线程在读取另一个线程的标志时可能会看到零,因为 x86 内存模型允许使用较早的存储对加载进行重新排序。因此,您可以按如下方式插入 mfence 屏障,以恢复顺序一致性和算法的正确行为:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mfence
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

在实践中,您不会看到 mfence 与您期望的一样多,因为 x86 lock-prefixed 指令具有相同的全屏障效果,这些是经常/总是(?)比mfence便宜。


1 例如,负载将得到满足,并且存储将变得全局可见(尽管只要与排序有关的可见效果“就像”发生一样,它就会以不同的方式实现)。

The intrinsic calls you mention all simply insert an sfence, lfence or mfence instruction when they are called. So the question then becomes "What are the purposes of those fence instructions"?

The short answer is that lfence is completely useless* and sfence almost completely useless for memory ordering purposes for user-mode programs in x86. On the other hand, mfence serves as a full memory barrier, so you might use it in places where you need a barrier if there isn't already some nearby lock-prefixed instruction providing what you need.

The longer-but-still short answer is...

lfence

lfence is documented to order loads prior to the lfence with respect to loads after, but this guarantee is already provided for normal loads without any fence at all: that is, Intel already guarantees that "loads aren't reordered with other loads". As a practical matter, this leaves the purpose of lfence in user-mode code as an out-of-order execution barrier, useful perhaps for carefully timing certain operations.

sfence

sfence is documented to order stores before and after in the same way that lfence does for loads, but just like loads the store order is already guaranteed in most cases by Intel. The primary interesting case where it doesn't is the so-called non-temporal stores such as movntdq, movnti, maskmovq and a few other instructions. These instructions don't play by the normal memory ordering rules, so you can put an sfence between these stores and any other stores where you want to enforce the relative order. mfence works for this purpose too, but sfence is faster.

mfence

Unlike the other two, mfence actually does something: it serves as a full memory barrier, ensuring that all of the previous loads and stores will have completed1 before any of the subsequent loads or stores begin execution. This answer is too short to explain the concept of a memory barrier fully, but an example would be Dekker's algorithm, where each thread wanting to enter a critical section stores to a location and then checks to see if the other thread has stored something to its location. For example, on thread 1:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

Here, on x86, you need a memory barrier in between the store (the first mov), and the load (the second mov), otherwise each thread could see zero when they read the other's flag because the x86 memory model allows loads to be re-ordered with earlier stores. So you could insert an mfence barrier as follows to restore sequential consistency and the correct behavior of the algorithm:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mfence
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

In practice, you don't see mfence as much as you might expect, because x86 lock-prefixed instructions have the same full-barrier effect, and these are often/always (?) cheaper than an mfence.


1 E.g., loads will have been satisfied and stores will have become globally visible (although it would be implemented differently as long as the visible effect wrt ordering is "as if" that occurred).

黒涩兲箜 2024-10-15 22:58:28

警告:我不是这方面的专家。我自己还在努力学习这个。但由于这两天没有人回复,看来内存栅栏指令方面的专家并不多。所以这是我的理解......

英特尔是一个弱有序 记忆系统。这意味着您的程序可能会执行,

array[idx+1] = something
idx++

但在更改数组之前,对idx的更改可能是全局可见的(例如,对于在其他处理器上运行的线程/进程)。在两个语句之间放置 sfence 将确保写入发送到 FSB 的顺序。

同时,另一个处理器运行

newestthing = array[idx]

可能已经缓存了数组的内存并且具有过时的副本,但由于缓存未命中而获得了更新的idx
解决方案是预先使用lfence来确保负载同步。

本文这篇文章可能会提供更好的信息

Caveat: I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ...

Intel is a weakly-ordered memory system. That means your program may execute

array[idx+1] = something
idx++

but the change to idx may be globally visible (e.g. to threads/processes running on other processors) before the change to array. Placing sfence between the two statements will ensure the order the writes are sent to the FSB.

Meanwhile, another processor runs

newestthing = array[idx]

may have cached the memory for array and has a stale copy, but gets the updated idx due to a cache miss.
The solution is to use lfence just beforehand to ensure the loads are synchronized.

This article or this article may give better info

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文