原子操作的成本是多少?

发布于 2024-08-26 23:01:05 字数 175 浏览 11 评论 0 原文

原子操作(任何比较和交换或原子加/减)的成本是多少?消耗多少周期?它会暂停 SMP 或 NUMA 上的其他处理器,还是会阻止内存访问? 它会刷新乱序 CPU 中的重新排序缓冲区吗?

对缓存会有什么影响?

我对现代流行的 CPU 感兴趣:x86、x86_64、PowerPC、SPARC、Itanium。

What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses?
Will it flush reorder buffer in out-of-order CPU?

What effects will be on the cache?

I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦境 2024-09-02 23:01:05

这几天我查了一些实际数据,但一无所获。
不过,我做了一些研究,将原子操作的成本与缓存未命中的成本进行了比较。

在 PentiumPro 之前(如文档中所述),x86 LOCK 前缀(包括用于原子 CAS 的 lock cmpxchg)的成本是内存访问(如缓存未命中),+通过以下方式停止内存操作其他处理器,+ 与试图锁定总线的其他处理器的任何争用。然而,从 PentiumPro 开始,对于正常的 Writeback 可缓存内存(应用程序处理的所有内存,除非直接与硬件对话),不是阻止所有内存操作,而是仅阻止相关的缓存行(基于 @osgx 的回答)。

即,核心延迟响应该线路的 MESI 共享和 RFO 请求,直到实际锁定操作的存储部分之后。这称为“高速缓存锁”,并且仅影响该高速缓存行。其他核心可以同时加载/存储甚至CASing其他线。


实际上,CAS 情况可能更复杂,如 此页面,没有时间安排,但由值得信赖的工程师进行了富有洞察力的描述。 (至少对于在实际 CAS 之前执行纯加载的正常用例而言。)

在详细讨论之前,我会说锁定操作会导致一次缓存未命中 + 与同一处理器上的其他处理器可能发生的争用缓存行,而 CAS + 前面的加载(几乎总是需要的,除了互斥体,在互斥体上总是 CAS 0 和 1)可能会导致两次缓存未命中。

他解释说,单个位置上的加载 + CAS 实际上可能会导致两次缓存未命中,例如加载链接/条件存储(请参阅此处了解后者)。他的解释依赖于 MESI 缓存一致性协议 的知识。它对缓存行使用 4 种状态:M(已修改)、E(独占)、S(已修改)、I(无效)(因此称为 MESI),下面将在需要时进行解释。场景解释如下:

  • LOAD 导致高速缓存未命中 - 相关高速缓存行以共享状态从内存加载(即,仍允许其他处理器将该高速缓存行保留在内存中;在此状态下不允许进行任何更改)。如果该位置在内存中,则跳过此高速缓存未命中。 可能的成本:1 次高速缓存未命中。(如果高速缓存行处于共享、独占或修改状态,即数据位于该 CPU 的 L1 高速缓存中,则跳过)。
  • 程序计算要存储的新值,
  • 并运行原子 CAS 指令。
    • 它必须避免并发修改,因此它必须从其他CPU的缓存中删除缓存行的副本,以将缓存行移动到独占状态。 可能的成本:1 次缓存未命中。如果它已经是独占拥有的,即处于独占或修改状态,则不需要这样做。在这两种状态下,没有其他 CPU 持有缓存行,但在独占状态下,它尚未被修改。
    • 在这次通信之后,变量在我们的 CPU 的本地缓存中被修改,此时它对所有其他 CPU 都是全局可见的(因为它们的缓存与我们的缓存是一致的)。最终会按照通常的算法写入主存。
    • 尝试读取或修改该变量的其他处理器首先必须以共享或独占模式获取该缓存行,并且这样做将联系该处理器并接收缓存行的更新版本。
      相反,锁定操作只会导致缓存未命中(因为将在独占状态下直接请求缓存行)。

在所有情况下,缓存行请求都可能被其他已经修改数据的处理器所阻止。

I have looked for actual data for the past days, and found nothing.
However, I did some research, which compares the cost of atomic ops with the costs of cache misses.

The cost of the x86 LOCK prefix, (including lock cmpxchg for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).

i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual locked operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.


Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)

Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.

He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:

  • the LOAD causes a cache miss - the relevant cacheline is loaded from memory in Shared state (i.e. other processors are still allowed to keep that cacheline in memory; no changes are allowed in this state). If the location is in memory, this cache miss is skipped. Possible cost: 1 cache miss. (skipped if the cacheline is in Shared, Exclusive or Modified state, i.e. the data is in this CPU's L1 cache).
  • the program calculates the new values to store,
  • and it runs an atomic CAS instruction.
    • It has to avoid concurrent modification, so it must remove copies of the cacheline from the cache of other CPUs, to move the cacheline to the Exclusive state. Possible cost: 1 cache miss. This is not needed if it is already exclusively owned, i.e. in the Exclusive or Modified state. In both states, no other CPUs hold the cacheline, but in the Exclusive state it has not been modified (yet).
    • After this communication, the variable is modified in our CPU's local cache, at which point it is globally visible to all other CPUs (because their caches are coherent with ours). It will eventually be written to main memory according to usual algorithms.
    • Other processors trying to read or modify that variable will first have to obtain that cacheline in Shared or Exclusive mode, and to do so will contact this processor and receive the updated version of the cacheline.
      A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).

In all cases, a cacheline request can be stalled by other processors already modifying the data.

桜花祭 2024-09-02 23:01:05

我使用以下设置进行了一些分析:启动测试机 (AMD Athlon64 x2 3800+),切换到长模式(禁用中断),并在循环中执行感兴趣的指令,展开 100 次迭代和 1,000 个循环周期。循环体对齐到 16 字节。时间是在循环之前和之后使用 rdtsc 指令测量的。此外,还执行了一个没有任何指令的虚拟循环(每个循环迭代测量了 2 个周期,其余的测量了 14 个周期),并将结果从指令分析时间的结果中减去。

测量了以下指令:

  • lock cmpxchg [rsp - 8], rdx”(比较匹配和不匹配)、
  • lock xadd [rsp - 8], rdx” ",
  • "lock bts qword ptr [rsp - 8], 1"

在所有情况下,测量的时间约为 310 个周期,误差约为 +/- 8 个周期

这是重复执行的值相同的(缓存的)内存。如果有额外的缓存未命中,时间会显着增加。此外,这是仅在 2 个核心之一处于活动状态时完成的,因此缓存是独占的,并且不需要缓存同步。

为了评估缓存未命中时锁定指令的成本,我在锁定指令之前添加了一条 wbinvld 指令,并将 wbinvld 加上一个 add [rsp - 8 ], rax 进入比较循环。在这两种情况下,每个指令对的成本约为 80,000 个周期!在锁定 bts 的情况下,时间差约为每条指令 180 个周期。

请注意,这是吞吐量的倒数,但由于锁定操作是序列化操作,因此延迟可能没有差异。

结论:锁定操作很重,但缓存未命中可能更重。
另外:锁定操作不会导致缓存未命中。当缓存行不被独占时,它只会导致缓存同步流量。

为了启动机器,我使用了 ReactOS 项目中的 x64 版本的 FreeLdr。这是asm源代码:

#define LOOP_COUNT 1000
#define UNROLLED_COUNT 100

PUBLIC ProfileDummy
ProfileDummy:

    cli

    // Get current TSC value into r8
    rdtsc
    mov r8, rdx
    shl r8, 32
    or r8, rax

    mov rcx, LOOP_COUNT
    jmp looper1

.align 16
looper1:

REPEAT UNROLLED_COUNT
    // nothing, or add something to compare against
ENDR

    dec rcx
    jnz looper1

    // Put new TSC minus old TSC into rax
    rdtsc
    shl rdx, 32
    or rax, rdx
    sub rax, r8

    ret

PUBLIC ProfileFunction
ProfileFunction:

    cli

    rdtsc
    mov r8, rdx
    shl r8, 32
    or r8, rax
    mov rcx, LOOP_COUNT

    jmp looper2

.align 16
looper2:

REPEAT UNROLLED_COUNT
    // Put here the code you want to profile
    // make sure it doesn't mess up non-volatiles or r8
    lock bts qword ptr [rsp - 8], 1
ENDR

    dec rcx
    jnz looper2

    rdtsc
    shl rdx, 32
    or rax, rdx
    sub rax, r8

    ret

I did some profiling with the following setup: The test machine (AMD Athlon64 x2 3800+) was booted, switched to long mode (interrupts disabled) and the instruction of interest was executed in a loop, 100 iterations unrolled and 1,000 loop cycles. The loop body was aligned to 16 bytes. The time was measured with an rdtsc instruction before and after the loop. Additionally a dummy loop without any instruction was executed (which measured 2 cycles per loop iteration and 14 cycles for the rest) and the result was substracted from the result of the instruction profiling time.

The following instructions were measured:

  • "lock cmpxchg [rsp - 8], rdx" (both with comparison match and mismatch),
  • "lock xadd [rsp - 8], rdx",
  • "lock bts qword ptr [rsp - 8], 1"

In all cases the time measured was about 310 cycles, the error was about +/- 8 cycles

This is the value for repeated execution on the same (cached) memory. With an additional cache miss, the times are considerably higher. Also this was done with only one of the 2 cores active, so the cache was exclusively owned, and no cache synchonisation was required.

To evaluate the cost of a locked instruction on a cache miss, I added a wbinvld instruction before the locked instruction and put the wbinvld plus an add [rsp - 8], rax into the comparison loop. In both cases the cost was about 80,000 cycles per instruction pair! In case of lock bts the time difference was about 180 cycles per instruction.

Note that this is the reciprocal throughput, but since locked operations are serializing operations, there is probably no difference to latency.

Conclusion: a locked operation is heavy, but a cache miss can be much heavier.
Also: a locked operation does not cause cache misses. It can only cause cache synchronisation traffic, when a cacheline is not owned exclusively.

To boot the machine, I used an x64 version of FreeLdr from the ReactOS project. Here's the asm source code:

#define LOOP_COUNT 1000
#define UNROLLED_COUNT 100

PUBLIC ProfileDummy
ProfileDummy:

    cli

    // Get current TSC value into r8
    rdtsc
    mov r8, rdx
    shl r8, 32
    or r8, rax

    mov rcx, LOOP_COUNT
    jmp looper1

.align 16
looper1:

REPEAT UNROLLED_COUNT
    // nothing, or add something to compare against
ENDR

    dec rcx
    jnz looper1

    // Put new TSC minus old TSC into rax
    rdtsc
    shl rdx, 32
    or rax, rdx
    sub rax, r8

    ret

PUBLIC ProfileFunction
ProfileFunction:

    cli

    rdtsc
    mov r8, rdx
    shl r8, 32
    or r8, rax
    mov rcx, LOOP_COUNT

    jmp looper2

.align 16
looper2:

REPEAT UNROLLED_COUNT
    // Put here the code you want to profile
    // make sure it doesn't mess up non-volatiles or r8
    lock bts qword ptr [rsp - 8], 1
ENDR

    dec rcx
    jnz looper2

    rdtsc
    shl rdx, 32
    or rax, rdx
    sub rax, r8

    ret
浸婚纱 2024-09-02 23:01:05

在基于总线的 SMP 上,原子前缀 LOCK 确实断言(打开)总线信号 LOCK#。它将禁止总线上的其他CPU/设备使用它。

Ppro& P2 书 http://books.google.com/books?id=3gDmyIYvFH4C&pg=PA245&dq=lock+instruction+pentium&lr= &ei=_E61S5ehLI78zQSzrqwI&cd=1#v=onepage&q=lock%20instruction%20pentium&f=false第244-246页

锁定指令是序列化、同步操作......
/关于乱序/锁定RMW/读取-修改-写入=原子本身/指令确保处理器在执行锁定指令之前先执行所有指令。
/关于尚未刷新的写入/它强制在执行下一条指令之前将处理器内的所有已发布写入刷新到外部存储器。

/关于 SMP/ 信号量处于 S 状态的缓存中...发出 0 字节日期的读取和无效事务(这是相邻 CPU 中缓存行的共享副本的终止/)

On bus-based SMP, atomic prefix LOCK does assert (turn on) a bus wire signal LOCK#. It will prohibit other cpus/devices on the bus for using it.

Ppro & P2 book http://books.google.com/books?id=3gDmyIYvFH4C&pg=PA245&dq=lock+instruction+pentium&lr=&ei=_E61S5ehLI78zQSzrqwI&cd=1#v=onepage&q=lock%20instruction%20pentium&f=false pages 244-246

Locked instructions are serializing, synchronizing operations ....
/about Out-of-order/ locked RMW/read-modify-write = atomic itself/ instruction ensures that the processor will execute all instructions before the locked instruction prior to executing it.
/about yet not flushed writes/ it forces all posted writes within the processor to be flushed to external memory before executing the next instruction.

/about SMP/ semaphore is in cache in S state... issuing a read and invalidate transaction for 0 bytes of date (this is a kill/of shared copies of the cache line in adjacent CPUs/)

生生漫 2024-09-02 23:01:05

原子操作的开销通常很大,并且由多种因素决定,包括CPU实现、并发性(线程/核心)和原子操作(读/写/FADD/CAS)。

我找到了一个简单的基准测试工具,您可以在您的设备上测试它。
https://github.com/wichtounet/articles/ blob/master/src/threads/benchmark/bench.cpp

我扩展了代码来比较非原子 FADD(请忽略正确性)和原子 FADD,并在 Intel i7-7700 CPU 上进行了测试。

单线程基准测试:显示原子 FADD 这次慢了 3 倍。

bench_nonatomic_fetch_add with 1 threads throughput = 650002 ops/ms
bench_atomic_fetch_add_relaxed with 1 threads throughput = 199457 ops/ms

2 个线程的基准测试:显示原子 FADD 这次慢了 12.7 倍。

bench_nonatomic_fetch_add with 2 threads throughput = 871079 ops/ms
bench_atomic_fetch_add_relaxed with 2 threads throughput = 68421 ops/ms

请注意,编写基准代码时,不要涉及内存屏障。屏障也会减慢代码速度,但原子操作和内存屏障是正交的概念。

The overhead of atomic operations is usually heavy and is determined by multiple factors, including CPU implementation, concurrency (threads/cores), and atomic operations (Read/Write/FADD/CAS).

I found a simple benchmark tool and you can test it on your device.
https://github.com/wichtounet/articles/blob/master/src/threads/benchmark/bench.cpp

I extended the code to compare non-atomic FADD (please ignore correctness) and atomic FADD, and tested it on an Intel i7-7700 CPU.

Benchmark with a single thread: It shows that atomic FADD is 3 times slower this time.

bench_nonatomic_fetch_add with 1 threads throughput = 650002 ops/ms
bench_atomic_fetch_add_relaxed with 1 threads throughput = 199457 ops/ms

Benchmark with 2 threads: It shows that atomic FADD is 12.7 times slower this time.

bench_nonatomic_fetch_add with 2 threads throughput = 871079 ops/ms
bench_atomic_fetch_add_relaxed with 2 threads throughput = 68421 ops/ms

Note that when writing benchmark code, do not involve memory barriers. Barriers also slow down your code but atomic operations and memory barriers are orthogonal concepts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文