当前位置：文江博客话题详情

原子 x86 指令与 MS 的 InterlockedCompareExchange 文档的对齐要求？

发布于 2024-08-05 00:39:00 字数 1197 浏览 22 评论 0原文

Microsoft 提供 InterlockedCompareExchange< /a> 用于执行原子比较和交换操作的函数。还有一个 _InterlockedCompareExchange 内在的。

在 x86 上，这些是使用 lock cmpxchg 指令实现的。

然而，通读这三种方法的文档，他们似乎在对齐要求上没有达成一致。

英特尔的参考手册没有提到对齐（除了如果启用了对齐检查并且进行了未对齐的内存引用，则会生成异常）

我还查找了lock前缀，它特别指出

LOCK 前缀的完整性不受内存字段对齐的影响。

（强调我的）

所以英特尔似乎说对齐是无关紧要的。无论如何，该操作都是原子的。

_InterlockedCompareExchange 内在文档也没有提及对齐，但是 InterlockedCompareExchange function 指出

该函数的参数必须在 32 位边界上对齐；否则，该函数在多处理器 x86 系统和任何非 x86 系统上的行为将无法预测。

那么什么给出呢？ InterlockedCompareExchange 的对齐要求是否只是为了确保该函数即使在 cmpxchg 指令不可用的 486 之前的 CPU 上也能正常工作？根据上述信息，这似乎是可能的，但我想在依赖它之前确定一下。 :)

还是 ISA 需要对齐来保证原子性，而我只是在英特尔参考手册中查找了错误的位置？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一刻暧昧 2024-08-12 00:39:00

x86不要求将lock cmpxchg指令对齐为原子指令。然而，为了获得良好的性能，对齐是必要的。

这应该不足为奇，向后兼容性意味着 14 年前使用手册编写的软件仍然可以在今天的处理器上运行。现代 CPU 甚至有一个专门用于分割锁检测的性能计数器，因为它非常昂贵。（核心不能在操作期间仅保留对单个高速缓存行的独占访问；它必须执行类似于传统总线锁定的操作）。

Microsoft 究竟为何记录对齐要求尚不清楚。这对于支持 RISC 架构当然是必要的，但多处理器 x86 上的不可预测行为的具体声明甚至可能是无效的。（除非它们意味着不可预测的性能，而不是正确性问题。）

您对仅适用于没有 lock cmpxchg 的 486 之前的系统的猜测可能是正确的；那里需要一种不同的机制，这可能需要某种围绕纯加载或纯存储的锁定。（另请注意，486 cmpxchg 有一个不同的当前未记录的操作码 (0f a7) 来自现代 cmpxchg (< code>0f b1) 这是 586 Pentium 的新功能；Windows 可能只在 P5 Pentium 及更高版本上使用 cmpxchg，我不知道。）这也许可以解释某些 x86 上的怪异，并不意味着现代 x86 上的怪异。

英特尔® 64 和 IA-32 架构软件开发人员手册
第 3 卷 (3A)：系统编程指南
2013 年 1 月
8.1.2.2 软件控制总线锁定
为了显式强制 LOCK 语义，软件可以在以下指令用于修改内存位置时使用 LOCK 前缀。 [...]
• 交换指令（XADD、CMPXCHG 和 CMPXCHG8B）。
• 自动为XCHG 指令假定LOCK 前缀。
• [...]
[...] 总线锁的完整性不受总线锁对齐的影响
记忆领域。 LOCK 语义遵循尽可能多的总线周期
根据需要更新整个操作数。不过还是建议
锁定的访问应与其自然边界对齐，以便更好地进行访问
系统性能：
• 8 位访问的任何边界（锁定或其他）。
• 锁定字访问的 16 位边界。
• 锁定双字访问的32 位边界。
• 锁定四字访问的64 位边界。

有趣的事实：cmpxchg 没有 lock 前缀仍然是原子的。上下文切换，因此可用于单核系统上的多线程。

即使未对齐，它仍然是原子的。中断（完全在之前或完全在之后），并且只有其他设备（例如 DMA）的内存读取才能看到撕裂。但此类访问也可以看到加载和存储之间的分离，因此即使旧 Windows 确实在单核系统上使用它来实现更高效的 InterlockedCompareExchange，它仍然不需要正确性对齐，只需要性能对齐。如果这可以用于硬件访问，Windows 可能不会这样做。

如果库函数需要执行与锁定 cmpxchg 分开的纯加载，这可能有意义，但不需要这样做。（如果不是内联的，32 位版本必须从堆栈加载其参数，但这是私有的，不能访问共享变量。）

x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.

This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).

Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)

Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.

Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013
8.1.2.2 Software Controlled Bus Locking
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]
• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]
[...] The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK semantics are followed for as many bus cycles
as necessary to update the entire operand. However, it is recommend
that locked accesses be aligned on their natural boundaries for better
system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.

Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.

Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.

If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)

回复收藏 0 原文

南街女流氓 2024-08-12 00:39:00

您引用的 PDF 是 1999 年的，显然已经过时了。

最新英特尔文档，特别是Volume-3A 讲述了一个不同的故事。

例如，在 Core-i7 处理器上，您仍然必须确保数据不会跨越缓存行，否则不能保证操作是原子的。

在第 3A 卷“系统编程，针对 x86/x64”中，英特尔明确指出：

8.1.1 有保证的原子操作
Intel486 处理器（以及此后更新的处理器）保证以下特性
基本的内存操作将始终以原子方式执行：
读取或写入一个字节
读取或写入在 16 位边界上对齐的字
读取或写入在 32 位边界上对齐的双字
奔腾处理器（以及此后更新的处理器）保证以下特性
额外的内存操作将始终以原子方式执行：
读取或写入在 64 位边界上对齐的四字
对适合 32 位数据总线的未缓存内存位置进行 16 位访问
P6 系列处理器（以及此后更新的处理器）保证以下特性
额外的内存操作将始终以原子方式执行：
对适合高速缓存的高速缓存内存进行未对齐的 16 位、32 位和 64 位访问
线
对跨缓存行和页边界分割的可缓存内存的访问
Intel Core 2 Duo、Intel® Atom™、Intel Core 不保证原子性
Duo、Pentium M、Pentium 4、Intel Xeon、P6 系列、Pentium 和 Intel486 处理器。
英特尔酷睿 2 双核、英特尔凌动、英特尔酷睿双核、奔腾 M、奔腾 4、英特尔至强、
和 P6 系列处理器提供允许外部存储器的总线控制信号
使分割访问成为原子的子系统；然而，非对齐数据访问将
严重影响处理器的性能，应该避免

回复收藏 0 原文

独木成林 2024-08-12 00:39:00

请参阅这个SO问题：自然对齐对于性能很重要，并且在x64架构上是必需的（因此，不仅是 PRE-x86 系统，还包括 POST-x86 系统 —— x64 可能仍然是一个小众案例，但它毕竟越来越受欢迎;-);这可能就是为什么微软按要求记录它（很难找到关于微软是否决定通过启用对齐检查来强制对齐问题的文档——这可能因Windows版本而异；通过在文档中声明需要对齐，微软保留了可以自由地在某些版本的 Windows 中强制执行此操作，即使他们没有在其他版本中强制执行此操作）。

回复收藏 0 原文