我可以在多核 x86 CPU 上强制缓存一致性吗?
前一周,我编写了一个小线程类和一个单向消息管道,以允许线程之间进行通信(显然,每个线程有两个管道,用于双向通信)。 在我的 Athlon 64 X2 上一切正常,但我想知道如果两个线程都在查看同一个变量并且每个核心上该变量的本地缓存值不同步,我是否会遇到任何问题。
我知道易失关键字将强制从内存中刷新变量,但是在多核x86处理器上有没有办法强制所有核心的缓存同步? 这是我需要担心的事情吗,还是 易失性 和正确使用轻量级锁定机制(我使用 _InterlockedExchange 来设置我的易失性管道变量)可以处理我想编写“无锁”的所有情况多核 x86 CPU 的代码?
我已经了解并使用过临界区、互斥体、事件等。 我主要想知道是否有 x86 内在函数我不知道哪些力量或可用于强制缓存一致性。
The other week, I wrote a little thread class and a one-way message pipe to allow communication between threads (two pipes per thread, obviously, for bidirectional communication). Everything worked fine on my Athlon 64 X2, but I was wondering if I'd run into any problems if both threads were looking at the same variable and the local cached value for this variable on each core was out of sync.
I know the volatile keyword will force a variable to refresh from memory, but is there a way on multicore x86 processors to force the caches of all cores to synchronize? Is this something I need to worry about, or will volatile and proper use of lightweight locking mechanisms (I was using _InterlockedExchange to set my volatile pipe variables) handle all cases where I want to write "lock free" code for multicore x86 CPUs?
I'm already aware of and have used Critical Sections, Mutexes, Events, and so on. I'm mostly wondering if there are x86 intrinsics that I'm not aware of which force or can be used to enforce cache coherency.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
易失性
仅强制您的代码重新读取该值,它无法控制从何处读取该值。 如果您的代码最近读取了该值,那么它可能会在缓存中,在这种情况下,易失性将强制从缓存中重新读取它,而不是从内存中。x86 中没有太多缓存一致性指令。 有诸如
prefetchnta
之类的预取指令,但这并不影响内存排序语义。 它过去是通过将值带到 L1 缓存而不污染 L2 来实现的,但对于具有大型共享包容性 L3 缓存的现代英特尔设计来说,事情变得更加复杂。x86 CPU 使用 MESI 协议(Intel 的 MESIF,AMD 的 MOESI)的变体来保持其缓存相互一致(包括不同核心的私有L1缓存)。 想要写入缓存行的核心必须强制其他核心使其副本无效,然后才能将其自己的副本从共享状态更改为修改状态。
您不需要任何栅栏指令(如 MFENCE)在一个线程中生成数据并在 x86 上的另一个线程中使用它,因为 x86 加载/存储具有 获取/释放语义内置。 您确实需要 MFENCE(全屏障)来获得顺序一致性。 (此答案的先前版本建议需要
clflush
,这是不正确的)。您确实需要防止编译时重新排序,因为 C++ 的内存模型是弱有序的。
volatile
是一种古老的、糟糕的方法; C++11 std::atomic 是编写无锁代码的更好方法。volatile
only forces your code to re-read the value, it cannot control where the value is read from. If the value was recently read by your code then it will probably be in cache, in which case volatile will force it to be re-read from cache, NOT from memory.There are not a lot of cache coherency instructions in x86. There are prefetch instructions like
prefetchnta
, but that doesn't affect the memory-ordering semantics. It used to be implemented by bringing the value to L1 cache without polluting L2, but things are more complicated for modern Intel designs with a large shared inclusive L3 cache.x86 CPUs use a variation on the MESI protocol (MESIF for Intel, MOESI for AMD) to keep their caches coherent with each other (including the private L1 caches of different cores). A core that wants to write a cache line has to force other cores to invalidate their copy of it before it can change its own copy from Shared to Modified state.
You don't need any fence instructions (like MFENCE) to produce data in one thread and consume it in another on x86, because x86 loads/stores have acquire/release semantics built-in. You do need MFENCE (full barrier) to get sequential consistency. (A previous version of this answer suggested that
clflush
was needed, which is incorrect).You do need to prevent compile-time reordering, because C++'s memory model is weakly-ordered.
volatile
is an old, bad way to do this; C++11 std::atomic is a much better way to write lock-free code.由于 x86 处理器采用 MESI 协议,因此保证了内核之间的缓存一致性。 在处理可能访问内存而数据仍位于核心缓存上的外部硬件时,您只需要担心内存一致性。 不过,这看起来不像你的情况,因为文本表明你正在用户区编程。
Cache coherence is guaranteed between cores due to the MESI protocol employed by x86 processors. You only need to worry about memory coherence when dealing with external hardware which may access memory while data is still siting on cores' caches. Doesn't look like it's your case here, though, since the text suggests you're programming in userland.
您无需担心缓存一致性。 硬件会解决这个问题。 您可能需要担心的是由于缓存一致性而导致的性能问题。
如果 core#1 写入变量,则会使其他核心中缓存行的所有其他副本无效(因为它必须获取
由于必须从内存中读取整个缓存行(64 字节)(或写回共享缓存,然后由 core#2 读取),因此会产生一些性能成本。 在这种情况下,这是不可避免的。 这是期望的行为。
问题是,当同一缓存行中有多个变量时,即使核心在同一缓存行中读取/写入不同的变量,处理器也可能会花费额外的时间来保持缓存同步。
通过确保这些变量不在同一缓存行中可以避免该成本。 这种效果被称为“错误共享”,因为您强制处理器同步实际上不在线程之间共享的对象的值。
You don't need to worry about cache coherency. The hardware will take care of that. What you may need to worry about is performance issues due to that cache coherency.
If core#1 writes to a variable, that invalidates all other copies of the cache line in other cores (because it has to get exclusive ownership of the cache line before committing the store). When core#2 reads that same variable, it will miss in cache (unless core#1 has already written it back as far as a shared level of cache).
Since an entire cache line (64 bytes) has to be read from memory (or written back to shared cache and then read by core#2), it will have some performance cost. In this case, it's unavoidable. This is the desired behavior.
The problem is that when you have multiple variables in the same cache line, the processor might spend extra time keeping the caches in sync even if the cores are reading/writing different variables within the same cache line.
That cost can be avoided by making sure those variables are not in the same cache line. This effect is known as False Sharing since you are forcing the processors to synchronize the values of objects which are not actually shared between threads.
挥发性不行。 在 C++ 中,易失性仅影响编译器优化,例如将变量存储在寄存器而不是内存中,或者完全删除它。
Volatile won't do it. In C++, volatile only affects what compiler optimizations such as storing a variable in a register instead of memory, or removing it entirely.
您没有指定您使用的编译器,但如果您使用的是 Windows,请查看 本文位于此处。 另请查看可用的此处的同步函数。 您可能需要注意,一般来说,
易失性
不足以执行您希望它执行的操作,但在 VC 2005 和 2008 下,添加了非标准语义,从而在周围添加了隐含的内存屏障读和写。如果你想让东西便携,你前面的路将会更加艰难。
You didn't specify which compiler you are using, but if you're on windows, take a look at this article here. Also take a look at the available synchronization functions here. You might want to note that in general
volatile
is not enough to do what you want it to do, but under VC 2005 and 2008, there are non-standard semantics added to it that add implied memory barriers around read and writes.If you want things to be portable, you're going to have a much harder road ahead of you.
此处有一系列解释现代内存架构的文章,包括Intel Core2 缓存 以及更多现代架构主题。
文章非常具有可读性并且插图精美。 享受 !
There's a series of articles explaining modern memory architectures here, including Intel Core2 caches and many more modern architecture topics.
Articles are very readable and well illustrated. Enjoy !
您的问题中有几个子问题,因此我将尽我所知回答它们。
There are several sub-questions in your question so I'll answer them to the best of my knowledge.
以下是一篇关于在线程程序中使用 易失性 的好文章。
易失性对于多线程编程几乎没用。
The following is a good article in reference to using
volatile
w/ threaded programs.Volatile Almost Useless for Multi-Threaded Programming.
Herb Sutter 似乎只是建议任何两个变量应该驻留在单独的缓存行上。 他在并发队列中通过锁和节点指针之间的填充来执行此操作。
编辑:如果您使用的是 Intel 编译器或 GCC,则可以使用 原子内置函数,它们似乎尽最大努力在可能的情况下抢占缓存。
Herb Sutter seemed to simply suggest that any two variables should reside on separate cache lines. He does this in his concurrent queue with padding between his locks and node pointers.
Edit: If you're using the Intel compiler or GCC, you can use the atomic builtins, which seem to do their best to preempt the cache when possible.