Intel CPU 上原子 cmpxchg 指令的平均延迟
我正在寻找有关各种英特尔处理器的锁定 cmpxchg 指令的平均延迟的一些参考。我无法找到有关该主题的任何好的参考资料,任何参考资料都会有很大帮助。
谢谢。
I am looking for some reference on average latencies for lock cmpxchg instruction for various intel processors. I am not able to locate any good reference on the topic and any reference would greatly help.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
最好的 x86 指令延迟参考可能包含在 Agner 的优化手册中,基于实际的经验测量适用于各种 Intel/AMD/VIA 芯片,并经常更新以适应市场上最新的 CPU。
不幸的是,我没有看到指令延迟表中列出的 CMPXCHG 指令,但第 4 页确实指出:
The best x86 instruction latency reference is probably that contained in Agner's optimization manuals, based on actual empirical measurements on various Intel/AMD/VIA chips and frequently updated for the latest CPUs on the market.
Unfortunately, I don't see the
CMPXCHG
instruction listed in the instruction latency tables, but page 4 does state:这方面的好的参考资料(如果有的话)很少,因为差异很大。它基本上取决于一切,包括总线速度、内存速度、处理器速度、处理器数量、周围指令、内存围栏以及很可能月球和珠穆朗玛峰之间的角度......
如果您有一个非常具体的应用程序,如已知的(固定)硬件、操作环境、实时操作系统和独占控制,那么也许它会很重要。在这种情况下,进行基准测试。如果您对软件的运行位置没有这种级别的控制,那么任何测量实际上都是毫无意义的。
正如这些答案中所讨论的,锁是使用 CAS 实现,因此如果您可以使用 CAS 而不是锁(这将需要至少两个操作),它会更快(明显?只是也许)。
您可以找到的最佳参考资料是英特尔软件开发人员手册,因为有变化太大,他们不会给你一个实际的数字。然而,他们将描述如何获得尽可能最佳的性能。可能是处理器数据表(例如此处针对 i7 Extreme Edition 的数据表,位于“技术文档”)将为您提供实际数字(或至少一个范围)。
There are few, if any, good references on this because there is so much variation. It depends on basically everything including bus speed, memory speed, processor speed, processor count, surrounding instructions, memory fencing and quite possibly the angle between the moon and Mt Everest...
If you have a very specific application, as in, known (fixed) hardware, operating environment, a real-time operating system and exclusive control, then maybe it will matter. In this case, benchmark. If you don't have this level of control over where your software is running, any measurements are effectively meaningless.
As discussed in these answers, locks are implemented using CAS, so if you can get away with CAS instead of a lock (which will need at least two operations) it will be faster (noticeably? only maybe).
The best references you will find are the Intel Software Developer's Manuals, though since there is so much variation they won't give you an actual number. They will, however, describe how to get the best performance possible. Possibly a processor datasheet (such as those here for the i7 Extreme Edition, under "Technical Documents") will give you actual numbers (or at least a range).
您可以使用AIDA64软件来检查指令延迟(但您无法检查要检查哪些指令,它有一个硬编码的指令列表)。 发布结果
人们正在 http://instlatx64.atw.hu/从
lock< /code> 指令,AIDA64 验证
lock add
指令和xchg [mem]
(即使没有显式锁定也始终锁定前缀
)。这是一些信息。我还将为您提供以下指令的延迟,以供比较:
xchg reg1, reg2
未锁定;添加
到寄存器和内存。如您所见,与非锁定内存存储相比,锁定指令在 Haswell-DT 上仅慢 5 倍,在 Kaby Lake-S 上仅慢约 2 倍。
英特尔酷睿 i5-4430,3000 MHz (30 x 100) Haswell-DT
英特尔酷睿 i7-7700K,4700 MHz (47 x 100) Kaby Lake-S
You can use AIDA64 software to check instruction latencies (but you cannot check which of the instructions to check, it has a hard-coded list of instructions). People are publishing the results at http://instlatx64.atw.hu/
From the
lock
instructions, AIDA64 verifies thelock add
instructions andxchg [mem]
(which is always locking even without an explicit lockprefix
).Here are some info. I will also give you, for comparison, latencies of the following instructions:
xchg reg1, reg2
which is not locking;add
to registers and memory.As you see, the locking instructions are just 5 times slower on Haswell-DT and just ~2 times slower on Kaby Lake-S than non-locking memory stores.
Intel Core i5-4430, 3000 MHz (30 x 100) Haswell-DT
Intel Core i7-7700K, 4700 MHz (47 x 100) Kaby Lake-S
几个月来我一直在研究指数退避。
CAS 的延迟完全取决于指令是否可以从高速缓存操作或必须从内存操作。通常,给定的内存地址由多个线程进行 CAS 处理(例如,指向队列的条目指针)。如果最近成功的 CAS 是由与当前 CAS 执行器共享高速缓存的逻辑处理器执行的(L1、L2 或 L3,当然更高级别的速度较慢),则指令将在高速缓存上运行并且速度会很快 -几个周期。如果最近成功的 CAS 是由不与当前执行器共享缓存的逻辑核心执行的,则最近 CASer 的写入将使当前执行器的缓存行无效,并且需要进行内存读取 - 这将需要数百个周期。
CAS操作本身非常快——几个周期——问题是内存。
I've been looking into exponential backoff for a few months now.
The latency of CAS is utterly dominated by whether or not the instruction can operate from cache or has to operate from memory. Typically, a given memory address is being CAS'd by a number of threads (say, an entry pointer to a queue). If the most recent successful CAS was performed by a logical processor which shares a cache with the current CAS executer (L1, L2 or L3, although of course the higher levels are slower) then the instruction will operate on cache and will be fast - a few cycles. If the most recent successful CAS was performed by a logical core which does not share a cache with the current excutor, then the write of the most recent CASer will have invalidated the cache line for the current executor and a memory read is required - this will take hundreds of cycles.
The CAS operation itself is very fast - a few cycles - the problem is memory.
我一直在尝试在 NOP 方面对 CAS 和 DCAS 进行基准测试。
我有一些结果,但我还不相信它们 - 验证正在进行中。
目前,我在 Core i5 上看到 CAS/DCAS 3/5 NOP。在 Xeon 上,我看到 20/22。
这些结果可能完全不正确——我们已警告您。
I've been trying to benchmark CAS and DCAS in terms of NOP.
I have some results, but I don't trust them yet - verification is ongoing.
Currently, I see on Core i5 for CAS/DCAS 3/5 NOPs. On Xeon, I see 20/22.
These results may be completely incorrect - you were warned.