C++11 引入了标准化内存模型。这是什么意思？它将如何影响 C++编程？

发布于 2024-11-15 06:57:26 字数 820 浏览 4 评论 0原文

C++11 引入了标准化内存模型，但这到底意味着什么？它将如何影响 C++ 编程？

这篇文章（作者：Gavin ClarkeHerb 的strong>萨特）说，

内存模型意味着C++代码现在有一个标准化的库可以调用不管编译器是谁制造的以及它在什么平台上运行。有一个标准方法可以控制如何不同的线程对话处理器的内存。
“当你谈论分裂时 [代码] 跨不同的核心在标准中，我们正在谈论内存模型。我们要去优化它而不破坏假设人们会去在代码中编写，”萨特说。

好吧，我可以记住这个以及网上提供的类似段落（因为我从出生起就有了自己的记忆模型：P）甚至可以发帖作为其他人提出的问题的答案，但说实话，我什至不太理解

以前用于开发多线程应用程序的 C++ 程序员，所以它是 POSIX 线程还是 Windows 又有什么关系呢？线程，或 C++11有什么好处？我想了解底层细节，

我也觉得 C++11 内存模型与 C++11 多线程支持有某种关系，因为我经常看到这两者在一起。如果是的话，它们到底是如何相关的？

我不知道多线程的内部结构是如何工作的，以及内存模型的一般含义是什么。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╭⌒浅淡时光〆 2024-11-22 06:57:26

首先，你必须学会像语言律师一样思考。

C++ 规范不引用任何特定的编译器、操作系统或 CPU。它引用了一个抽象机器，它是实际系统的概括。在语言律师的世界里，程序员的工作是为抽象机编写代码；编译器的工作是在具体机器上实现该代码。通过严格按照规范进行编码，您可以确定您的代码无需修改即可在任何具有兼容 C++ 编译器的系统上编译和运行，无论是现在还是 50 年后。

C++98/C++03 规范中的抽象机基本上是单线程的。因此，不可能编写符合规范的“完全可移植”的多线程 C++ 代码。该规范甚至没有提及内存加载和存储的原子性或加载和存储可能发生的顺序，更不用说互斥体之类的事情了。

当然，您可以在实践中为特定的具体系统（例如 pthreads 或 Windows）编写多线程代码。但是没有标准方法来为C++98/C++03编写多线程代码。

C++11 中的抽象机在设计上是多线程的。它还具有明确定义的内存模型；也就是说，它说明了编译器在访问内存时可以做什么和不可以做什么。

考虑以下示例，其中两个线程同时访问一对全局变量：

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

线程 2 可能输出什么？

在C++98/C++03下，这甚至不是未定义的行为；这个问题本身毫无意义，因为该标准没有考虑任何称为“线程”的东西。

在 C++11 下，结果是未定义行为，因为加载和存储通常不需要是原子的。这看起来似乎没什么太大的改进……但就其本身而言，事实并非如此。

但使用 C++11，您可以这样写：

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

现在事情变得更有趣了。首先，这里的行为是定义的。线程 2 现在可以打印 0 0（如果它在线程 1 之前运行）、37 17（如果它在线程 1 之后运行）或 0 17 > （如果它在线程 1 分配给 x 之后但在分配给 y 之前运行）。

它无法打印 37 0，因为 C++11 中原子加载/存储的默认模式是强制顺序一致性。这仅意味着所有加载和存储必须“好像”按照您在每个线程中编写它们的顺序发生，而线程之间的操作可以按照系统喜欢的方式交错。因此，原子的默认行为为加载和存储提供了原子性和排序。

现在，在现代 CPU 上，确保顺序一致性的成本可能很高。特别是，编译器可能会在每次访问之间发出完整的内存屏障。但是，如果您的算法可以容忍无序加载和存储；即，如果它需要原子性但不需要排序；即，如果它可以容忍 37 0 作为该程序的输出，那么您可以这样写：

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

CPU 越现代，就越有可能比前面的示例更快。

最后，如果您只需要保持特定的加载和存储顺序，您可以编写：

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

这将我们带回到有序的加载和存储 - 因此 37 0 不再是可能的输出 - 但它确实因此开销最小。（在这个简单的示例中，结果与成熟的顺序一致性相同；在较大的程序中，情况并非如此。）

当然，如果您想看到的唯一输出是 0 0或 37 17，您只需在原始代码周围包裹一个互斥体即可。但如果您已经读到这里，我敢打赌您已经知道它是如何工作的，并且这个答案已经比我预期的要长了:-)。

所以，底线。互斥体很棒，C++11 对它们进行了标准化。但有时出于性能原因，您需要较低级别的原语（例如，经典的双重检查锁定模式）。新标准提供了互斥体和条件变量等高级小工具，还提供了原子类型和各种类型的内存屏障等低级小工具。因此，现在您可以完全使用标准指定的语言编写复杂的高性能并发例程，并且可以确定您的代码将在今天和明天的系统上编译和运行不变。

坦率地说，除非您是专家并且正在处理一些严重的低级代码，否则您可能应该坚持使用互斥体和条件变量。这就是我打算做的。

有关此内容的更多信息，请参阅这篇博文。

First, you have to learn to think like a Language Lawyer.

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.

Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.

The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".

Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

But with C++11, you can write this:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

Finally, if you just need to keep particular loads and stores in order, you can write:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

For more on this stuff, see this blog post.

回复收藏 0 原文

思慕 2024-11-22 06:57:26

我将仅给出我理解内存一致性模型（或简称内存模型）的类比。它的灵感来自 Leslie Lamport 的开创性论文 “分布式环境中的时间、时钟和事件顺序”系统”。
这个类比很恰当，也具有根本意义，但对许多人来说可能有点过分了。然而，我希望它提供一个有助于推理记忆一致性模型的心理图像（图形表示）。

让我们在时空图中查看所有内存位置的历史，其中水平轴代表地址空间（即每个内存位置由该轴上的一个点表示），垂直轴代表时间（我们将看到，一般来说，没有通用的时间概念）。因此，每个内存位置保存的值的历史由该内存地址处的垂直列表示。每个值的更改都是由于其中一个线程向该位置写入新值所致。通过内存映像，我们指的是在特定时间可观察到的所有内存位置的值的聚合/组合> 通过特定线程。

引自“内存一致性和缓存一致性入门”

直观的（也是最严格的）内存模型是顺序一致性（SC），其中多线程执行应该看起来像每个组成线程的顺序执行的交错，就好像线程在单核上进行时间复用一样处理器。

全局内存顺序可能因程序的一次运行而异，并且可能事先未知。 SC 的特征是地址时空图中的一组水平切片，代表同时平面（即内存映像）。在给定平面上，其所有事件（或内存值）都是同时发生的。有一个“绝对时间”的概念，其中所有线程都同意哪些内存值是同时发生的。在SC中，在每一时刻，所有线程都只共享一个内存映像。也就是说，在每个时刻，所有处理器都同意内存映像（即内存的聚合内容）。这不仅意味着所有线程查看所有内存位置的相同值序列，而且还意味着所有处理器观察所有变量的相同值组合。这相当于所有线程以相同的总顺序观察所有内存操作（在所有内存位置上）。

在宽松的内存模型中，每个线程将以自己的方式分割地址空间时间，唯一的限制是每个线程的切片不得相互交叉，因为所有线程必须就每个单独内存位置的历史达成一致（当然，不同线程的切片可能并且将会相互交叉）。没有通用的方法将其分割（没有地址-空间-时间的特权叶状结构）。切片不必是平面的（或线性的）。它们可以是弯曲的，这使得一个线程可以不按写入顺序读取另一个线程写入的值。不同内存位置的历史记录可以相对于彼此任意滑动（或拉伸）当通过任何特定线程查看时。每个线程对于哪些事件（或者等效地，内存值）是同时发生的有不同的感觉。与一个线程同时发生的一组事件（或内存值）与另一个线程不同时发生。因此，在宽松的内存模型中，所有线程仍然观察每个内存位置的相同历史记录（即值序列）。但他们可能会观察到不同的内存图像（即所有内存位置的值的组合）。即使同一线程按顺序写入两个不同的内存位置，其他线程也可能以不同的顺序观察到这两个新写入的值。

【图片来自维基百科】

熟悉爱因斯坦狭义相对论的读者会注意到我所暗示的内容。将 Minkowski 的话翻译成内存模型领域：地址空间和时间是地址空间时间的影子。在这种情况下，每个观察者（即线程）将把事件的影子（即内存存储/加载）投射到他自己的世界线（即他的时间轴）和他自己的同时平面（他的地址空间轴）上。 C++11 内存模型中的线程对应于狭义相对论中彼此相对移动的观察者。顺序一致性对应于伽利略时空（即，所有观察者都同意事件的绝对顺序和全球同时性感）。

记忆模型和狭义相对论之间的相似之处源于以下事实：两者都定义了部分有序的事件集，通常称为因果集。一些事件（即内存存储）可以影响其他事件（但不受其影响）。 C++11 线程（或物理学中的观察者）只不过是事件链（即全序集）（例如，内存加载和存储到可能不同的地址）。

在相对论中，部分有序事件的看似混乱的图景被恢复了某种秩序，因为所有观察者都同意的唯一时间顺序是“类时”事件之间的顺序（即原则上可以通过任何速度较慢的粒子连接的事件）比真空中的光速）。只有与时间相关的事件是不变排序的。
物理学时代，克雷格·卡伦德。

在 C++11 内存模型中，使用类似的机制（获取-释放一致性模型）来建立这些局部因果关系。

为了提供内存一致性的定义和放弃 SC 的动机，我将引用“入门指南”关于内存一致性和缓存一致性”

对于共享内存机器，内存一致性模型定义了其内存系统的架构上可见的行为。单个处理器核心的正确性标准将行为划分为“一个正确结果”和“许多不正确的替代结果”。这是因为处理器的体系结构要求线程的执行将给定的输入状态转换为单个明确定义的输出状态，即使在无序内核上也是如此。然而，共享内存一致性模型涉及多个线程的加载和存储，并且通常允许许多正确的执行，同时不允许许多（更多）不正确的执行。多次正确执行的可能性是由于 ISA 允许多个线程同时执行，通常具有来自不同线程的指令的许多可能的合法交错。
宽松或弱内存一致性模型的动机是强模型中的大多数内存排序都是不必要。如果一个线程更新了十个数据项，然后更新了一个同步标志，程序员通常不关心数据项是否按彼此的顺序更新，而只关心所有数据项在标志更新之前更新（通常使用 FENCE 指令实现））。宽松模型试图捕捉这种增加的排序灵活性，并仅保留程序员“要求”以获得 SC 的更高性能和正确性的顺序。例如，在某些架构中，每个核心使用 FIFO 写入缓冲区来在将结果写入高速缓存之前保存已提交（退休）存储的结果。这种优化提高了性能，但违反了 SC。写入缓冲区隐藏了服务存储未命中的延迟。由于商店很常见，因此能够避免在大多数商店中滞留是一个重要的好处。对于单核处理器，通过确保对地址 A 的加载将最新存储的值返回到 A，即使对 A 的一个或多个存储在写缓冲区中，也可以使写缓冲区在架构上不可见。这通常是通过将最近存储到 A 的值旁路到从 A 的加载来完成的，其中“最新”由程序顺序确定，或者如果到 A 的存储位于写入缓冲区中，则通过停止 A 的加载来完成。当使用多个核心时，每个核心都有自己的旁路写入缓冲区。如果没有写缓冲区，硬件是 SC，但有写缓冲区则不是 SC，这使得写缓冲区在多核处理器中在架构上可见。
如果内核具有非 FIFO 写入缓冲区，允许存储按照与其进入的顺序不同的顺序离开，则可能会发生存储-存储重新排序。如果第一个存储在高速缓存中未命中，而第二个存储命中，或者如果第二个存储可以与较早的存储（即，在第一个存储之前）合并，则可能会发生这种情况。加载-加载重新排序也可能发生在不按程序顺序执行指令的动态调度核心上。这与在另一个核心上重新排序存储的行为相同（您能举出两个线程之间交错的示例吗？）。使用较晚的存储对较早的加载进行重新排序（加载-存储重新排序）可能会导致许多不正确的行为，例如在释放保护值的锁后加载值（如果存储是解锁操作）。请注意，即使核心按照程序顺序执行所有指令，由于通常实现的 FIFO 写入缓冲区中的本地旁路，也可能会出现存储加载重新排序。

由于缓存一致性和内存一致性有时会被混淆，因此引用以下内容也很有启发：

与一致性不同，缓存一致性对软件既不可见，也不是必需的。 Coherence 旨在使共享内存系统的缓存在功能上与单核系统中的缓存一样不可见。正确的一致性确保程序员无法通过分析加载和存储的结果来确定系统是否以及在何处具有缓存。这是因为正确的一致性可确保缓存永远不会启用新的或不同的功能行为（程序员仍然可以使用计时< /strong> 信息）。缓存一致性协议的主要目的是维持每个内存位置的单写入器多读取器（SWMR）不变。
连贯性和一致性之间的一个重要区别是，连贯性是在每个内存位置的基础上指定的，而一致性是针对所有指定的强>内存位置。

继续我们的想象，SWMR 不变量对应于这样的物理要求：任何一个位置最多有一个粒子，但任何位置可以有无限数量的观察者。

I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.

Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.

Quoting from "A Primer on Memory Consistency and Cache Coherence"

The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.

That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.

In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.

[Picture from Wikipedia]

Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).

The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).

In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
Time in Physics, Craig Callender.

In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.

To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"

For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.
Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.
Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.

Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:

Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.

Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.

回复收藏 0 原文

帅的被狗咬 2024-11-22 06:57:26

_{这是一个已有多年历史的问题，但非常受欢迎，值得一提的是，这是学习 C++11 内存模型的绝佳资源。我认为没有必要总结他的演讲来做出另一个完整的答案，但考虑到这是真正编写标准的人，我认为非常值得观看演讲。}

赫伯·萨特 (Herb Sutter) 有三个小时的时间关于 C++11 内存模型的长篇大论，标题为“atomic<>Weapons”，可在 ~~Channel9 网站~~ YouTube 上找到 - 第 1 部分和第 2 部分。该演讲非常技术性，涵盖以下主题：

优化、竞争和内存模型
排序 - 内容：获取和释放
排序 - 如何：互斥体、原子和/或栅栏
的其他限制
对编译器和硬件代码生成性能：x86/x64、IA64、POWER、ARM
Relaxed Atomics

该演讲没有详细说明 API，而是详细介绍了推理、背景、底层和幕后（您是否知道宽松语义仅添加到标准中）因为POWER和ARM不支持高效的同步加载？）。

回复收藏 0 原文

用心笑 2024-11-22 06:57:26

这意味着该标准现在定义了多线程，并且定义了在多线程上下文中发生的情况。当然，人们使用了不同的实现，但这就像问为什么我们应该使用 std::string ，而我们都可以使用自制的 string 类。

当您谈论 POSIX 线程或 Windows 线程时，这有点错觉，因为实际上您谈论的是 x86 线程，因为它是并发运行的硬件功能。无论您使用的是 x86、ARM 还是 MIPS，或者任何你能想到的东西。

回复收藏 0 原文

强者自强 2024-11-22 06:57:26

对于未指定内存模型的语言，您正在为处理器体系结构指定的语言和内存模型编写代码。处理器可能会选择重新排序内存访问以提高性能。因此，如果您的程序存在数据竞争（数据竞争是指多个核心/超线程可以同时访问同一内存），那么您的程序就不是跨平台的，因为它依赖于处理器内存模型。您可以参考 Intel 或 AMD 软件手册来了解处理器如何重新排序内存访问。

非常重要的是，锁（以及带有锁定的并发语义）通常以跨平台方式实现......因此，如果您在没有数据竞争的多线程程序中使用标准锁，那么您不必担心跨平台平台内存模型。

有趣的是，Microsoft C++ 编译器具有 volatile 的获取/释放语义，这是一个 C++ 扩展，用于解决 C++ 中缺乏内存模型的问题 http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx。然而，考虑到 Windows 仅在 x86 / x64 上运行，这并不算什么（Intel 和 AMD 内存模型可以轻松高效地以语言实现获取/释放语义）。

回复收藏 0 原文

我一直都在从未离去 2024-11-22 06:57:26

如果您使用互斥体来保护所有数据，那么您真的不需要担心。互斥体始终提供足够的排序和可见性保证。

现在，如果您使用原子或无锁算法，则需要考虑内存模型。内存模型准确地描述了原子何时提供排序和可见性保证，并为手动编码保证提供了便携式围栏。

以前，原子是使用编译器内部函数或一些更高级别的库来完成的。栅栏可以使用特定于 CPU 的指令（内存屏障）来完成。

回复收藏 0 原文

清欢 2024-11-22 06:57:26

其他一些答案涉及 C++ 内存模型的最基本方面。在实践中，std::atomic<> 的大多数使用都“正常工作”，至少在程序员过度优化之前（例如，通过尝试放松太多事情）。

有一个地方仍然很常见错误：顺序锁。 https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf。顺序锁很有吸引力，因为读者可以避免写入锁字。以下代码基于上述技术报告的图 1，它强调了在 C++ 中实现序列锁时面临的挑战：

atomic<uint64_t> seq; // seqlock representation
int data1, data2;     // this data will be protected by seq

T reader() {
    int r1, r2;
    unsigned seq0, seq1;
    while (true) {
        seq0 = seq;
        r1 = data1; // INCORRECT! Data Race!
        r2 = data2; // INCORRECT!
        seq1 = seq;

        // if the lock didn't change while I was reading, and
        // the lock wasn't held while I was reading, then my
        // reads should be valid
        if (seq0 == seq1 && !(seq0 & 1))
            break;
    }
    use(r1, r2);
}

void writer(int new_data1, int new_data2) {
    unsigned seq0 = seq;
    while (true) {
        if ((!(seq0 & 1)) && seq.compare_exchange_weak(seq0, seq0 + 1))
            break; // atomically moving the lock from even to odd is an acquire
    }
    data1 = new_data1;
    data2 = new_data2;
    seq = seq0 + 2; // release the lock by increasing its value to even
}

data1 和 data2 乍一看似乎不直观> 必须是原子。如果它们不是原子的，那么它们可以在写入（在 writer() 中）的同时被读取（在 reader() 中）。根据 C++ 内存模型，即使 reader() 从未真正使用过数据，这也是一场竞赛。此外，如果它们不是原子的，则编译器可以将每个值的第一次读取缓存在寄存器中。显然您不希望这样...您希望在 reader() 中的 while 循环的每次迭代中重新读取。

使它们原子并使用memory_order_relaxed访问它们也是不够的。原因是 seq 的读取（在 reader() 中）仅具有获取语义。简单来说，如果 X 和 Y 是内存访问，X 在 Y 之前，X 不是获取或释放，并且 Y 是获取，则编译器可以在 X 之前重新排序 Y。如果 Y 是 seq 的第二次读取，并且 X是读取数据，这样的重新排序会破坏锁的实现。

论文给出了一些解决方案。目前性能最佳的可能是在第二次读取 seqlock 之前使用 atomic_thread_fence 和 memory_order_relaxed 的那个。在论文中，它是图 6。我不会在这里复制代码，因为任何读过这篇文章的人都应该阅读这篇论文。它比这篇文章更精确和完整。

最后一个问题是，使 data 变量原子化可能不自然。如果你不能在你的代码中，那么你需要非常小心，因为从非原子到原子的转换仅对原始类型是合法的。 C++20应该添加atomic_ref<>，这将使这个问题更容易解决。

总结一下：即使您认为自己了解 C++ 内存模型，在滚动自己的序列锁之前也应该非常小心。

Some of the other answers get at the most fundamental aspects of the C++ memory model. In practice, most uses of std::atomic<> "just work", at least until the programmer over-optimizes (e.g., by trying to relax too many things).

There is one place where mistakes are still common: sequence locks. There is an excellent and easy-to-read discussion of the challenges at https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf. Sequence locks are appealing because the reader avoids writing to the lock word. The following code is based on Figure 1 of the above technical report, and it highlights the challenges when implementing sequence locks in C++:

atomic<uint64_t> seq; // seqlock representation
int data1, data2;     // this data will be protected by seq

T reader() {
    int r1, r2;
    unsigned seq0, seq1;
    while (true) {
        seq0 = seq;
        r1 = data1; // INCORRECT! Data Race!
        r2 = data2; // INCORRECT!
        seq1 = seq;

        // if the lock didn't change while I was reading, and
        // the lock wasn't held while I was reading, then my
        // reads should be valid
        if (seq0 == seq1 && !(seq0 & 1))
            break;
    }
    use(r1, r2);
}

void writer(int new_data1, int new_data2) {
    unsigned seq0 = seq;
    while (true) {
        if ((!(seq0 & 1)) && seq.compare_exchange_weak(seq0, seq0 + 1))
            break; // atomically moving the lock from even to odd is an acquire
    }
    data1 = new_data1;
    data2 = new_data2;
    seq = seq0 + 2; // release the lock by increasing its value to even
}

As unintuitive as it seems at first, data1 and data2 need to be atomic<>. If they are not atomic, then they could be read (in reader()) at the exact same time as they are written (in writer()). According to the C++ memory model, this is a race even if reader() never actually uses the data. In addition, if they are not atomic, then the compiler can cache the first read of each value in a register. Obviously you wouldn't want that... you want to re-read in each iteration of the while loop in reader().

It is also not sufficient to make them atomic<> and access them with memory_order_relaxed. The reason for this is that the reads of seq (in reader()) only have acquire semantics. In simple terms, if X and Y are memory accesses, X precedes Y, X is not an acquire or release, and Y is an acquire, then the compiler can reorder Y before X. If Y was the second read of seq, and X was a read of data, such a reordering would break the lock implementation.

The paper gives a few solutions. The one with the best performance today is probably the one that uses an atomic_thread_fence with memory_order_relaxed before the second read of the seqlock. In the paper, it's Figure 6. I'm not reproducing the code here, because anyone who has read this far really ought to read the paper. It is more precise and complete than this post.

The last issue is that it might be unnatural to make the data variables atomic. If you can't in your code, then you need to be very careful, because casting from non-atomic to atomic is only legal for primitive types. C++20 is supposed to add atomic_ref<>, which will make this problem easier to resolve.

To summarize: even if you think you understand the C++ memory model, you should be very careful before rolling your own sequence locks.

回复收藏 0 原文