当前位置：文江博客话题详情

有人可以简单解释一下“Full Fences”是如何实现的吗？在 .Net 中使用 Threading.MemoryBarrier 实现？

发布于 2024-08-26 11:03:45 字数 62 浏览 7 评论 0原文

我清楚 MemoryBarrier 的用法，但不清楚运行时幕后发生的事情。谁能对发生的事情给出一个很好的解释？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍景流年 2024-09-02 11:03:45

在一个真正强大的内存模型中，发出栅栏指令是不必要的。所有内存访问都将按顺序执行，并且所有存储都是全局可见的。

需要内存栅栏是因为当前的通用架构不提供强大的内存模型 - x86/x64 可以示例相对于写入重新排序读取。（更全面的来源是“英特尔® 64 和 IA-32 架构软件开发人员手册，8.2.2 P6 和更新的处理器系列中的内存排序”）。作为一个例子，Dekker 的算法将在 x86/ 上失败x64 没有围栏。

即使 JIT 生成的机器代码中仔细放置了具有内存加载和存储的指令，但如果 CPU 然后对这些加载和存储重新排序，那么它的努力也是无用的 - 只要为 << 保持顺序一致性的假象，它就可以这样做。 em>当前上下文/线程。

冒着过于简单化的风险：它可能有助于将指令流产生的加载和存储可视化为雷鸣般的野生动物群。
当它们穿过一座狭窄的桥（你的 CPU）时，你永远无法确定动物的顺序，因为它们中的一些会慢一些，一些快一些，一些超越，一些落后。
如果一开始（当您发出机器代码时）通过在它们之间放置无限长的栅栏将它们分成组，您至少可以确保 A 组位于 B 组之前。

栅栏确保读取和写入的顺序。措辞并不准确，但是：

存储栅栏“等待”所有未完成的存储（写入）操作完成，但不影响加载。
加载栅栏“等待”所有未完成的加载（读取）操作完成，但不影响存储。
完整的围栏“等待”所有存储和装载操作完成。它的效果是，在栅栏之前的读取和写入将在“栅栏另一侧”（晚于栅栏）的写入和加载之前执行。

JIT 为完整栅栏发出的内容取决于 (CPU) 架构以及它提供的内存排序保证。
由于 JIT 确切地知道它运行在什么架构上，因此它可以发出正确的指令。

在我的 x64 机器上，使用 .NET 4.0 RC，它恰好是一个锁或。

            int a = 0;
00000000  sub         rsp,28h 
            Thread.MemoryBarrier();
00000004  lock or     dword ptr [rsp],0 
            Console.WriteLine(a);
00000009  mov         ecx,1 
0000000e  call        FFFFFFFFEFB45AB0 
00000013  nop 
00000014  add         rsp,28h 
00000018  ret

英特尔® 64 和 IA-32 架构软件开发人员手册第 8.1.2 章：

“...锁定操作序列化所有未完成的加载和存储操作（即等待它们完成）。”
...“锁定操作相对于所有其他内存操作和所有
外部可见的事件。只有取指令和页表访问可以通过
锁定指令。锁定指令可用于同步写入的数据
一个处理器并由另一个处理器读取。”
内存排序指令解决了这一特定需求。MFENCE 可以在上面用作完全屏障情况（至少在理论上 - 一方面，锁定操作可能会更快，对于两个可能会导致不同的行为）。及其朋友可以在第 8.2.5 章“加强或削弱内存排序模型” 中找到

还有更多方法可以序列化存储和加载，尽管它们要么不切实际，要么比上述方法慢：

在第 8.3 章中，您可以找到完整的序列化指令，例如 CPUID。这些序列化指令流也是如此：“没有任何东西可以传递序列化指令并且
串行化指令不能传递任何其他指令（读、写、指令
fetch（或 I/O）”。
如果您将内存设置为强非缓存 (UC)，它将为您提供强大的内存模型：不允许推测或无序访问，所有访问都将出现在总线上，因此无需发出:) 当然，这会比平常慢一点。

所以

这取决于是否有一台具有强大顺序保证的计算机，

IA64 和其他体系结构可能不会发出任何指令。他们自己的内存模型 - 从而保证内存排序（或缺乏内存排序） - 以及他们自己的处理内存存储/加载排序的指令/方法。

In a really strong memory model, emitting fence instructions would be unnecessary. All memory accesses would execute in order and all stores would be globally visible.

Memory fences are needed because current common architectures do not provide a strong memory model - x86/x64 can for example reorder reads relative to writes. (A more thorough source is "Intel® 64 and IA-32 Architectures Software Developer’s Manual, 8.2.2 Memory Ordering in P6 and More Recent Processor Families"). As an example from the gazillions, Dekker's algorithm will fail on x86/x64 without fences.

Even if the JIT produces machine code in which instructions with memory loads and stores are carefully placed, its efforts are useless if the CPU then reorders these loads and stores - which it can, as long as the illusion of sequential consistency is maintained for the current context/thread.

Risking oversimplification: it may help to visualize the loads and stores resulting from the instruction stream as a thundering herd of wild animals.
As they cross a narrow bridge (your CPU), you can never be sure about the order of the animals, since some of them will be slower, some faster, some overtake, some fall behind.
If at the start - when you emit the machine code - you partition them into groups by putting infinitely long fences between them, you can at least be sure that group A comes before group B.

Fences ensure the ordering of reads and writes. Wording is not exact, but:

a store fence "waits" for all outstanding store (write) operations to finish, but does not affect loads.
a load fence "waits" for all outstanding load (read) operations to finish, but does not affect stores.
a full fence "waits" for all store and load operations to finish. It has the effect that reads and writes before the fence will get executed before the writes and loads that are on the "other side of the fence" (come later than the fence).

What the JIT emits for a full fence, depends on the (CPU) architecture and what memory ordering guarantees it provides.
Since the JIT knows exactly what architecture it runs on, it can issue the proper instruction(s).

On my x64 machine, with .NET 4.0 RC, it happens to be a lock or.

            int a = 0;
00000000  sub         rsp,28h 
            Thread.MemoryBarrier();
00000004  lock or     dword ptr [rsp],0 
            Console.WriteLine(a);
00000009  mov         ecx,1 
0000000e  call        FFFFFFFFEFB45AB0 
00000013  nop 
00000014  add         rsp,28h 
00000018  ret

Intel® 64 and IA-32 Architectures Software Developer’s Manual Chapter 8.1.2:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)."
..."Locked operations are atomic with respect to all other memory operations and all
externally visible events. Only instruction fetch and page table accesses can pass
locked instructions. Locked instructions can be used to synchronize data written by
one processor and read by another processor."
memory-ordering instructions address this specific need. MFENCE could have been used as full barrier in the above case (at least in theory - for one, locked operations might be faster, for two it might result in different behavior). MFENCE and its friends can be found in Chapter 8.2.5 "Strengthening or Weakening the Memory-Ordering Model".

There are some more ways to serialize stores and loads, though they are either impractical or slower than the above methods:

In chapter 8.3 you can find full serializing instructions like CPUID. These serialize instruction flow as well: "Nothing can pass a serializing instruction and
a serializing instruction cannot pass any other instruction (read, write, instruction
fetch, or I/O)".
If you set up memory as strong uncached (UC), it will give you a strong memory model: no speculative or out-of order accesses will be allowed and all accesses will appear on the bus, therefore no need to emit an instruction. :) Of course, this will be a tad slower than usual.

...

So it depends on. If there was a computer with strong ordering guarantees, the JIT would probably emit nothing.

IA64 and other architectures have their own memory models - and thus guarantees of memory ordering (or lack of them) - and their own instructions/ways to deal with memory store/load ordering.

回复收藏 0 原文