原子操作的成本是多少?
原子操作(任何比较和交换或原子加/减)的成本是多少?消耗多少周期?它会暂停 SMP 或 NUMA 上的其他处理器,还是会阻止内存访问? 它会刷新乱序 CPU 中的重新排序缓冲区吗?
对缓存会有什么影响?
我对现代流行的 CPU 感兴趣:x86、x86_64、PowerPC、SPARC、Itanium。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这几天我查了一些实际数据,但一无所获。
不过,我做了一些研究,将原子操作的成本与缓存未命中的成本进行了比较。
在 PentiumPro 之前(如文档中所述),x86 LOCK 前缀(包括用于原子 CAS 的
lock cmpxchg
)的成本是内存访问(如缓存未命中),+通过以下方式停止内存操作其他处理器,+ 与试图锁定总线的其他处理器的任何争用。然而,从 PentiumPro 开始,对于正常的 Writeback 可缓存内存(应用程序处理的所有内存,除非直接与硬件对话),不是阻止所有内存操作,而是仅阻止相关的缓存行(基于 @osgx 的回答)。即,核心延迟响应该线路的 MESI 共享和 RFO 请求,直到实际锁定操作的存储部分之后。这称为“高速缓存锁”,并且仅影响该高速缓存行。其他核心可以同时加载/存储甚至CASing其他线。
实际上,CAS 情况可能更复杂,如 此页面,没有时间安排,但由值得信赖的工程师进行了富有洞察力的描述。 (至少对于在实际 CAS 之前执行纯加载的正常用例而言。)
在详细讨论之前,我会说锁定操作会导致一次缓存未命中 + 与同一处理器上的其他处理器可能发生的争用缓存行,而 CAS + 前面的加载(几乎总是需要的,除了互斥体,在互斥体上总是 CAS 0 和 1)可能会导致两次缓存未命中。
他解释说,单个位置上的加载 + CAS 实际上可能会导致两次缓存未命中,例如加载链接/条件存储(请参阅此处了解后者)。他的解释依赖于 MESI 缓存一致性协议 的知识。它对缓存行使用 4 种状态:M(已修改)、E(独占)、S(已修改)、I(无效)(因此称为 MESI),下面将在需要时进行解释。场景解释如下:
相反,锁定操作只会导致缓存未命中(因为将在独占状态下直接请求缓存行)。
在所有情况下,缓存行请求都可能被其他已经修改数据的处理器所阻止。
I have looked for actual data for the past days, and found nothing.
However, I did some research, which compares the cost of atomic ops with the costs of cache misses.
The cost of the x86 LOCK prefix, (including
lock cmpxchg
for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual
lock
ed operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)
Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.
He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:
A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).
In all cases, a cacheline request can be stalled by other processors already modifying the data.
我使用以下设置进行了一些分析:启动测试机 (AMD Athlon64 x2 3800+),切换到长模式(禁用中断),并在循环中执行感兴趣的指令,展开 100 次迭代和 1,000 个循环周期。循环体对齐到 16 字节。时间是在循环之前和之后使用 rdtsc 指令测量的。此外,还执行了一个没有任何指令的虚拟循环(每个循环迭代测量了 2 个周期,其余的测量了 14 个周期),并将结果从指令分析时间的结果中减去。
测量了以下指令:
lock cmpxchg [rsp - 8], rdx
”(比较匹配和不匹配)、lock xadd [rsp - 8], rdx
” ",lock bts qword ptr [rsp - 8], 1
"在所有情况下,测量的时间约为 310 个周期,误差约为 +/- 8 个周期
这是重复执行的值相同的(缓存的)内存。如果有额外的缓存未命中,时间会显着增加。此外,这是仅在 2 个核心之一处于活动状态时完成的,因此缓存是独占的,并且不需要缓存同步。
为了评估缓存未命中时锁定指令的成本,我在锁定指令之前添加了一条
wbinvld
指令,并将wbinvld
加上一个add [rsp - 8 ], rax
进入比较循环。在这两种情况下,每个指令对的成本约为 80,000 个周期!在锁定 bts 的情况下,时间差约为每条指令 180 个周期。请注意,这是吞吐量的倒数,但由于锁定操作是序列化操作,因此延迟可能没有差异。
结论:锁定操作很重,但缓存未命中可能更重。
另外:锁定操作不会导致缓存未命中。当缓存行不被独占时,它只会导致缓存同步流量。
为了启动机器,我使用了 ReactOS 项目中的 x64 版本的 FreeLdr。这是asm源代码:
I did some profiling with the following setup: The test machine (AMD Athlon64 x2 3800+) was booted, switched to long mode (interrupts disabled) and the instruction of interest was executed in a loop, 100 iterations unrolled and 1,000 loop cycles. The loop body was aligned to 16 bytes. The time was measured with an rdtsc instruction before and after the loop. Additionally a dummy loop without any instruction was executed (which measured 2 cycles per loop iteration and 14 cycles for the rest) and the result was substracted from the result of the instruction profiling time.
The following instructions were measured:
lock cmpxchg [rsp - 8], rdx
" (both with comparison match and mismatch),lock xadd [rsp - 8], rdx
",lock bts qword ptr [rsp - 8], 1
"In all cases the time measured was about 310 cycles, the error was about +/- 8 cycles
This is the value for repeated execution on the same (cached) memory. With an additional cache miss, the times are considerably higher. Also this was done with only one of the 2 cores active, so the cache was exclusively owned, and no cache synchonisation was required.
To evaluate the cost of a locked instruction on a cache miss, I added a
wbinvld
instruction before the locked instruction and put thewbinvld
plus anadd [rsp - 8], rax
into the comparison loop. In both cases the cost was about 80,000 cycles per instruction pair! In case of lock bts the time difference was about 180 cycles per instruction.Note that this is the reciprocal throughput, but since locked operations are serializing operations, there is probably no difference to latency.
Conclusion: a locked operation is heavy, but a cache miss can be much heavier.
Also: a locked operation does not cause cache misses. It can only cause cache synchronisation traffic, when a cacheline is not owned exclusively.
To boot the machine, I used an x64 version of FreeLdr from the ReactOS project. Here's the asm source code:
在基于总线的 SMP 上,原子前缀
LOCK
确实断言(打开)总线信号LOCK#
。它将禁止总线上的其他CPU/设备使用它。Ppro& P2 书 http://books.google.com/books?id=3gDmyIYvFH4C&pg=PA245&dq=lock+instruction+pentium&lr= &ei=_E61S5ehLI78zQSzrqwI&cd=1#v=onepage&q=lock%20instruction%20pentium&f=false第244-246页
On bus-based SMP, atomic prefix
LOCK
does assert (turn on) a bus wire signalLOCK#
. It will prohibit other cpus/devices on the bus for using it.Ppro & P2 book http://books.google.com/books?id=3gDmyIYvFH4C&pg=PA245&dq=lock+instruction+pentium&lr=&ei=_E61S5ehLI78zQSzrqwI&cd=1#v=onepage&q=lock%20instruction%20pentium&f=false pages 244-246
原子操作的开销通常很大,并且由多种因素决定,包括CPU实现、并发性(线程/核心)和原子操作(读/写/FADD/CAS)。
我找到了一个简单的基准测试工具,您可以在您的设备上测试它。
https://github.com/wichtounet/articles/ blob/master/src/threads/benchmark/bench.cpp
我扩展了代码来比较非原子 FADD(请忽略正确性)和原子 FADD,并在 Intel i7-7700 CPU 上进行了测试。
单线程基准测试:显示原子 FADD 这次慢了 3 倍。
2 个线程的基准测试:显示原子 FADD 这次慢了 12.7 倍。
请注意,编写基准代码时,不要涉及内存屏障。屏障也会减慢代码速度,但原子操作和内存屏障是正交的概念。
The overhead of atomic operations is usually heavy and is determined by multiple factors, including CPU implementation, concurrency (threads/cores), and atomic operations (Read/Write/FADD/CAS).
I found a simple benchmark tool and you can test it on your device.
https://github.com/wichtounet/articles/blob/master/src/threads/benchmark/bench.cpp
I extended the code to compare non-atomic FADD (please ignore correctness) and atomic FADD, and tested it on an Intel i7-7700 CPU.
Benchmark with a single thread: It shows that atomic FADD is 3 times slower this time.
Benchmark with 2 threads: It shows that atomic FADD is 12.7 times slower this time.
Note that when writing benchmark code, do not involve memory barriers. Barriers also slow down your code but atomic operations and memory barriers are orthogonal concepts.