当前位置：文江博客话题详情

哪些 CPU 架构支持比较和交换 (CAS)？

发布于 2024-07-05 16:22:44 字数 33 浏览 8 评论 0原文

只是想知道哪些 CPU 架构支持比较和交换原子原语？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情丝乱 2024-07-12 16:22:45

抱歉，写了很多信。 :(

x86 ISA 中的几乎所有指令（除了所谓的字符串指令，也许还有一些其他指令），包括 CMPXCHG，在单核 CPU 的上下文中都是原子的。这是因为根据 x86 架构，CPU 在每个指令之后检查到达的中断指令执行完成，并且不会在中间执行，因此，可以检测到中断请求，并且仅在两个连续指令执行之间的边界上启动中断请求，因此，CPU 在执行单个指令期间获取的所有内存引用都是隔离的。不能与任何其他活动交错，这种行为对于单核和多核 CPU 来说很常见，但如果在单核 CPU 的情况下，系统只有一个单元执行对内存的访问，那么在多核 CPU 的情况下，则存在。多个系统单元同时执行对内存的访问，指令隔离不足以保证这种环境中的一致性，因为不同 CPU 在同一时间进行的内存访问可能会相互交错。由于这个额外的保护层必须应用于数据改变协议。对于 x86，这一层是锁前缀，它在系统总线上启动原子事务。

在此处输入图像描述

摘要：使用 CMPXCHG、XADD、BTS 等同步指令是安全且成本较低的，无需lock 前缀，如果您确信该指令访问的数据只能由一个核心访问。如果您对此不确定，请应用锁定前缀以通过牺牲性能来提供安全性。

CPU 支持硬件同步有两种主要方法：

基于原子事务。
基于缓存一致性协议。

没有人是灵丹妙药。两种方法都有其优点和缺点。

基于原子事务的方法依赖于对内存总线上特殊类型事务的支持。在此类事务期间，只有连接到总线的一个代理（CPU 核心）有资格访问内存。结果，一方面，总线所有者在原子事务期间进行的所有存储器引用都确保作为单个不可中断事务进行。另一方面，所有其他总线代理（CPU 核心）将被迫等待原子事务完成，以恢复访问内存的能力。他们想要访问什么内存单元并不重要，即使他们想要访问原子事务期间总线所有者未引用的内存区域。因此，大量使用锁定前缀指令将显着降低系统速度。另一方面，由于总线仲裁器根据循环调度为每个总线代理提供对总线的访问，因此保证了每个总线代理对内存的访问相对公平，并且所有代理都会被公平地访问内存。能够取得进步并以同样的速度取得进步。此外，在原子事务的情况下，ABA 问题就会出现，因为就其本质而言，原子事务非常短（单个指令进行的内存引用很少），并且事务期间对内存采取的所有操作仅依赖于内存区域的值在不考虑这一点的情况下，内存区域在两个事务之间被其他人访问了。
基于原子事务的同步支持的一个很好的例子是 x86 架构，其中锁定前缀指令强制 CPU 在原子事务中执行它们。

基于高速缓存一致性协议的方法依赖于这样的事实：存储器线在某一时刻只能被高速缓存在一个L1高速缓存中。高速缓存一致性系统中的内存访问协议类似于下一个动作序列：

CPU A 将内存行 X 存储在 L1 高速缓存中。同时CPU B希望访问存储器线X。(X→CPU A L1)
CPU B在总线上发出存储器线X访问事务。 (X --> CPU A L1)
所有总线代理（CPU 内核）都有一个所谓的监听代理，它侦听总线上的所有事务并检查事务请求的内存行访问是否存储在其所有者 CPU L1 缓存中。因此，CPU A 监听代理检测到 CPU A 拥有 CPU B 请求的内存行。(X --> CPU A L1)
CPU A 暂停 CPU B 发出的内存访问事务。(X --> CPU A L1)
CPU A 从其 L1 高速缓存中刷新 B 请求的内存行。 (X --> 内存)
CPU A 恢复先前挂起的事务。 (X-->存储器)
CPU B从存储器中取出存储器行X。 (X --> CPU B L1)

由于该协议，CPU 核心总是访问内存中的实际数据，并且对内存的访问是严格按顺序串行化的，一次访问是及时的。
基于缓存一致性协议的同步支持依赖于这样一个事实：CPU 可以轻松检测到特定内存行在两个时间点之间被访问。在对必须打开事务的行X的第一次内存访问期间，CPU可以标记L1高速缓存中的该内存行必须由监听代理控制。反过来，窥探代理可以在高速缓存行刷新期间另外执行检查以识别该行是否被标记为控制，并且如果受控行被刷新则引发内部标志。因此，如果 CPU 在关闭事务的内存访问期间检查内部标志，它将知道受控制的内存行可以被其他人更改，并得出结论事务必须成功完成或必须被视为失败。
这就是LL\SC指令类的实现方式。这种方法比原子事务更简单，并且在同步方面提供了更大的灵活性，因为与原子事务方法相比，可以在其基础上构建更多数量的不同同步原语。这种方法更具可扩展性和效率，因为它不会阻止系统所有其他部分对内存的访问。正如你所看到的，它解决了 ABA 问题，因为它基于内存区域访问检测的事实，而不是基于内存区域变化检测的值。对参与正在进行的事务的内存区域的任何访问都将被视为事务失败。这可能是好事，也可能是坏事，因为特定的算法可能只对内存区域的值感兴趣，而不考虑该位置被中间人访问，直到该访问改变了内存。在这种情况下，读取中间的内存值将导致误报事务失败。此外，该方法可能会导致同一内存行上的控制流的性能大幅下降，因为它们能够不断地相互隔离内存行，从而阻止彼此成功完成事务。这确实是一个重大问题，因为在终端情况下它可以使系统陷入活锁状态。
基于缓存一致性协议的同步支持通常用于 RISC CPU，因为它简单且灵活。但必须指出的是，Intel 也决定在 x86 架构中支持这种同步支持方法。去年，英特尔宣布了 x86 架构的事务同步扩展，该扩展将在 Haswell 一代英特尔处理器中实施。结果，x86 看起来将拥有最强大的同步支持，并允许系统开发人员利用这两种方法的优点。

Sorry for a lot of letters. :(

Almost all instructions in the x86 ISA (except so called string instructions, and maybe few others), including CMPXCHG, are atomic in the context of unicore CPU. This is because according to the x86 architecture, CPU checks for arrived interrupts after each instruction execution completion and never in the middle. As a result, interrupt request can be detected and it handling be launched only on boundary between execution of two consecutive instructions. Due to this all memory references taken by CPU during execution of single instruction are isolated and can't be interleaved by any other activities. That behavior is common for unicore and multicore CPUs. But if in the context of unicore CPU there is only one unit of the system that performs access to the memory, in the context of multicore CPU there are more then one unit of the system which performs access to the memory simultaneously. Instruction isolation isn't enough for consistency in such environment, because memory accesses made by different CPUs in the same time can interleave each other. Due to this additional protection layer must be applied to the data changing protocol. For the x86 this layer is lock prefix, that initiates atomic transaction on the system bus.

enter image description here

Summary: It is safe and less costly to use sync instructions like CMPXCHG, XADD, BTS, etc. without lock prefix if you assured, that the data accessed by this instruction can be accessed only by one core. If you are not assured in this, apply lock prefix to provide safety by trading off performance.

There are two major approach for hardware synchronization support by CPU:

Atomic transaction based.
Cache coherence protocol based.

No one is silver bullet. Both approaches have they advantages and disadvantages.

Atomic transactions based approach relies to the supporting of the special type of transactions on the memory bus. During such transaction only one agent (CPU core) connected to the bus is eligible to access memory. As result, on the one hand, all memory references made by the bus owner during atomic transaction are assured to be made as a single uninterruptible transaction. On the another hand all other bus agents (CPU cores) will be enforced to wait the atomic transaction completion, to get back the ability to access memory. It doesn't matter, what memory cells they want to access, even if they want to access the memory region that is not referenced by bus owner during atomic transaction. As result extensive use of lock prefixed instructions will slow down the system significantly. On the other hand, due to the fact that the bus arbiter gives access to the bus for each bus agent according to the round robin scheduling, there is a guarantee that each bus agent will have relatively fair access to the memory and all agents will be able to made progress and made it with the same speed. In addition, ABA problem come into the play in case of atomic transactions, because by its nature, atomic transactions is very short (few memory references made by single instruction) and all actions taken on memory during transaction rely only to the value of memory region, without taking into the account, is that memory region was accessed by some one else between two transactions.
Good example of atomic transaction based sync support is x86 architecture, in which lock prefixed instructions enforce CPU execute them in atomic transactions.

Cache coherence protocol based approach rely to the fact that the memory line can be cached only in the one L1 cache in the one instant of time. The memory access protocol in cache coherence system is similar to next sequence of actions:

CPU A store the memory line X in L1 cache. In the same time CPU B desire to access memory line X. (X --> CPU A L1)
CPU B issue memory line X access transaction on the bus. (X --> CPU A L1)
All bus agents (CPU cores) have a so called snooping agent that listen all transactions on the bus and check if memory line access to which was requested by transaction is stored in its owner CPU L1 cache. So, CPU A snooping agent detects that CPU A owns the memory line requested by CPU B. (X --> CPU A L1)
CPU A suspend memory access transaction issued by CPU B. (X --> CPU A L1)
CPU A flush the memory line requested by B from its L1 cache. (X --> memory)
CPU A resume previously suspended transaction. (X --> memory)
CPU B fetch memory line X from the memory. (X --> CPU B L1)

Thank to that protocol CPU core always access the actual data in memory, and accesses to the memory are serialized in strict order, one access in time.
Cache coherence protocol based sync support rely to the fact, that CPU can easily detect, that the particular memory line was accessed between two time points. During the first memory access to the line X that must open transaction, CPU can mark that memory line in L1 cache must be controlled by snooping agent. In its turn snooping agent can during cache line flush in addition perform check to identify is that line is marked for control, and raise internal flag if controlled line flushed. As result, if CPU will check the internal flag during memory access that close the transaction, it will know is controlled memory line was able to be changed by someone else and conclude is transaction must be accomplished with success or must be considered as failed.
This is the way of LL\SC instruction class implementation. This approach more simple that atomic transaction and provides much more flexibility in synchronization, because much more number of different sync primitives can be build on it base in comparison with atomic transactions approach. This approach is more scalable and efficient, because it doesn't block access to the memory for all other parts of the system. And as you can see it solves the ABA problem, because it base on the fact of memory region access detection, but not on value of memory region change detection. Any access to the memory region participating in ongoing transaction will be considered as an transaction fail. And this can be good and bad in the same time, because particular algorithm can be interested only in the value of memory region and doesn't take in the account is that location was accessed by someone in the middle, until that access change the memory. In that case read of memory value in the middle will lead to false negative transaction fail. In addition that approach can lead to huge performance degradation of control flows contenting on the same memory line, because they are able to constantly steel memory line from each other, and by this preventing each other from completion transaction with success. That is really significant problem because in terminal case it can turn system in livelock.
Cache coherence protocol based sync support usually used in RISC CPU, because of it simplicity and flexibility. But it must be noted that Intel decided to support such approach for synchronization support in x86 architecture too. At last year Intel announced the Transactional Synchronization Extensions to x86 architecture that will be implemented in Haswell generation of Intel processors. In result, it looks like, the x86 will have most powerful support of synchronization and allow system developers to use advantages of both approaches.

回复收藏 0 原文

老街孤人 2024-07-12 16:22:45

有些人评论/询问 cmpxchg 在 x86/x64 上是否需要“lock”前缀。对于多核机器来说，答案是肯定的。该指令对于没有锁的单核机器来说是完全原子的。

我已经有一段时间没有深入研究过这个东西了，但我似乎记得该指令在技术上是可重新启动的——它可以在飞行中中止指令（如果它还没有任何副作用）以避免延迟中断处理。长的。

回复收藏 0 原文

樱花坊 2024-07-12 16:22:45

Intel x86 有此支持。 IBM 在其 Solaris 到 Linux 移植指南中给出了以下示例：

bool_t My_CompareAndSwap(IN int *ptr, IN int old, IN int new)
{
        unsigned char ret;

        /* Note that sete sets a 'byte' not the word */
        __asm__ __volatile__ (
                "  lock\n"
                "  cmpxchgl %2,%1\n"
                "  sete %0\n"
                : "=q" (ret), "=m" (*ptr)
                : "r" (new), "m" (*ptr), "a" (old)
                : "memory");

        return ret;
}

Intel x86 has this support. IBM in it's Solaris to Linux Porting Guide gives this example:

bool_t My_CompareAndSwap(IN int *ptr, IN int old, IN int new)
{
        unsigned char ret;

        /* Note that sete sets a 'byte' not the word */
        __asm__ __volatile__ (
                "  lock\n"
                "  cmpxchgl %2,%1\n"
                "  sete %0\n"
                : "=q" (ret), "=m" (*ptr)
                : "r" (new), "m" (*ptr), "a" (old)
                : "memory");

        return ret;
}

回复收藏 0 原文

时间海 2024-07-12 16:22:45

从 ARMv6 架构开始，ARM 具有可用于实现原子比较交换操作的 LDREX/STREX 指令。

回复收藏 0 原文

黑色毁心梦 2024-07-12 16:22:45

为了完成这个列表，MIPS 具有加载链接 (ll) 和条件存储 (sc) 指令，它们从内存加载一个值，然后在没有其他 CPU 访问该位置的情况下有条件地存储。确实，您可以使用这些指令来执行交换、增量和其他操作。然而，缺点是，当大量 CPU 大量使用锁时，您会陷入活锁：条件存储将频繁失败，并需要另一个循环来重试，这将失败，等等。

软件 mutex_lock 实现可能会变得非常复杂，尝试如果这些情况被认为足够重要而值得担心，则实施指数退避。在我使用的一个系统中，它们有 128 个核心。

回复收藏 0 原文