Intel CPU 上原子 cmpxchg 指令的平均延迟

发布于 2024-10-02 18:24:11 字数 99 浏览 0 评论 0原文


我正在寻找有关各种英特尔处理器的锁定 cmpxchg 指令的平均延迟的一些参考。我无法找到有关该主题的任何好的参考资料,任何参考资料都会有很大帮助。

谢谢。

I am looking for some reference on average latencies for lock cmpxchg instruction for various intel processors. I am not able to locate any good reference on the topic and any reference would greatly help.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

海的爱人是光 2024-10-09 18:24:11

最好的 x86 指令延迟参考可能包含在 Agner 的优化手册中,基于实际的经验测量适用于各种 Intel/AMD/VIA 芯片,并经常更新以适应市场上最新的 CPU。

不幸的是,我没有看到指令延迟表中列出的 CMPXCHG 指令,但第 4 页确实指出:

带有 LOCK 前缀的指令具有较长的延迟,这取决于缓存组织和可能的 RAM 速度。如果有多个处理器或内核或直接内存访问 (DMA) 设备,则所有锁定指令将锁定高速缓存行以进行独占访问,这可能涉及 RAM 访问。即使在单处理器系统上,LOCK 前缀通常也会花费超过一百个时钟周期。这也适用于带有内存操作数的 XCHG 指令。

The best x86 instruction latency reference is probably that contained in Agner's optimization manuals, based on actual empirical measurements on various Intel/AMD/VIA chips and frequently updated for the latest CPUs on the market.

Unfortunately, I don't see the CMPXCHG instruction listed in the instruction latency tables, but page 4 does state:

Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

暮倦 2024-10-09 18:24:11

这方面的好的参考资料(如果有的话)很少,因为差异很大。它基本上取决于一切,包括总线速度、内存速度、处理器速度、处理器数量、周围指令、内存围栏以及很可能月球和珠穆朗玛峰之间的角度......

如果您有一个非常具体的应用程序,如已知的(固定)硬件、操作环境、实时操作系统和独占控制,那么也许它会很重要。在这种情况下,进行基准测试。如果您对软件的运行位置没有这种级别的控制,那么任何测量实际上都是毫无意义的。

正如这些答案中所讨论的,锁是使用 CAS 实现,因此如果您可以使用 CAS 而不是锁(这将需要至少两个操作),它会更快(明显?只是也许)。

您可以找到的最佳参考资料是英特尔软件开发人员手册,因为有变化太大,他们不会给你一个实际的数字。然而,他们将描述如何获得尽可能最佳的性能。可能是处理器数据表(例如此处针对 i7 Extreme Edition 的数据表,位于“技术文档”)将为您提供实际数字(或至少一个范围)。

There are few, if any, good references on this because there is so much variation. It depends on basically everything including bus speed, memory speed, processor speed, processor count, surrounding instructions, memory fencing and quite possibly the angle between the moon and Mt Everest...

If you have a very specific application, as in, known (fixed) hardware, operating environment, a real-time operating system and exclusive control, then maybe it will matter. In this case, benchmark. If you don't have this level of control over where your software is running, any measurements are effectively meaningless.

As discussed in these answers, locks are implemented using CAS, so if you can get away with CAS instead of a lock (which will need at least two operations) it will be faster (noticeably? only maybe).

The best references you will find are the Intel Software Developer's Manuals, though since there is so much variation they won't give you an actual number. They will, however, describe how to get the best performance possible. Possibly a processor datasheet (such as those here for the i7 Extreme Edition, under "Technical Documents") will give you actual numbers (or at least a range).

〆凄凉。 2024-10-09 18:24:11

您可以使用AIDA64软件来检查指令延迟(但您无法检查要检查哪些指令,它有一个硬编码的指令列表)。 发布结果

人们正在 http://instlatx64.atw.hu/lock< /code> 指令,AIDA64 验证 lock add 指令和 xchg [mem](即使没有显式锁定也始终锁定前缀)。

这是一些信息。我还将为您提供以下指令的延迟,以供比较:

  • xchg reg1, reg2 未锁定;
  • 添加到寄存器和内存。

如您所见,与非锁定内存存储相比,锁定指令在 Haswell-DT 上仅慢 5 倍,在 Kaby Lake-S 上仅慢约 2 倍。

英特尔酷睿 i5-4430,3000 MHz (30 x 100) Haswell-DT

LOCK ADD [m8], r8         L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m16], r16       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m32], r32       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m32 + 8], r32   L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m64], r64       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m64 + 16], r64  L: 5.96ns= 17.8c  T: 7.21ns= 21.58c

XCHG r8, [m8]             L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r16, [m16]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r32, [m32]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r64, [m64]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c

ADD r32, 0x04000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x08000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x10000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x20000          L: 0.22ns=  0.9c  T: 0.08ns=  0.34c
ADD r8, r8                L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
ADD r16, r16              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
ADD r32, r32              L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
ADD r64, r64              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
ADD r8, [m8]              L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r16, [m16]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r32, [m32]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r64, [m64]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD [m8], r8              L: 1.19ns=  5.0c  T: 0.32ns=  1.33c
ADD [m16], r16            L: 1.19ns=  5.0c  T: 0.21ns=  0.88c
ADD [m32], r32            L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
ADD [m32 + 8], r32        L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
ADD [m64], r64            L: 1.19ns=  5.0c  T: 0.20ns=  0.85c
ADD [m64 + 16], r64       L: 1.19ns=  5.0c  T: 0.18ns=  0.73c

英特尔酷睿 i7-7700K,4700 MHz (47 x 100) Kaby Lake-S

LOCK ADD [m8], r8         L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m16], r16       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m32], r32       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m32 + 8], r32   L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m64], r64       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m64 + 16], r64  L: 4.01ns= 16.8c  T: 5.12ns= 21.50c

XCHG r8, [m8]             L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
XCHG r16, [m16]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
XCHG r32, [m32]           L: 4.01ns= 16.8c  T: 5.20ns= 21.83c
XCHG r64, [m64]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c

ADD r32, 0x04000          L: 0.33ns=  1.0c  T: 0.12ns=  0.36c
ADD r32, 0x08000          L: 0.31ns=  0.9c  T: 0.12ns=  0.37c
ADD r32, 0x10000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
ADD r32, 0x20000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
ADD r8, r8                L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
ADD r16, r16              L: 0.31ns=  0.9c  T: 0.11ns=  0.32c
ADD r32, r32              L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
ADD r64, r64              L: 0.31ns=  0.9c  T: 0.10ns=  0.31c
ADD r8, [m8]              L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r16, [m16]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r32, [m32]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r64, [m64]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD [m8], r8              L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
ADD [m16], r16            L: 1.87ns=  5.6c  T: 0.26ns=  0.78c
ADD [m32], r32            L: 1.87ns=  5.6c  T: 0.28ns=  0.84c
ADD [m32 + 8], r32        L: 1.89ns=  5.7c  T: 0.26ns=  0.78c
ADD [m64], r64            L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
ADD [m64 + 16], r64       L: 1.89ns=  5.7c  T: 0.24ns=  0.73c

You can use AIDA64 software to check instruction latencies (but you cannot check which of the instructions to check, it has a hard-coded list of instructions). People are publishing the results at http://instlatx64.atw.hu/

From the lock instructions, AIDA64 verifies the lock add instructions and xchg [mem] (which is always locking even without an explicit lock prefix).

Here are some info. I will also give you, for comparison, latencies of the following instructions:

  • xchg reg1, reg2 which is not locking;
  • add to registers and memory.

As you see, the locking instructions are just 5 times slower on Haswell-DT and just ~2 times slower on Kaby Lake-S than non-locking memory stores.

Intel Core i5-4430, 3000 MHz (30 x 100) Haswell-DT

LOCK ADD [m8], r8         L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m16], r16       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m32], r32       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m32 + 8], r32   L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m64], r64       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
LOCK ADD [m64 + 16], r64  L: 5.96ns= 17.8c  T: 7.21ns= 21.58c

XCHG r8, [m8]             L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r16, [m16]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r32, [m32]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
XCHG r64, [m64]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c

ADD r32, 0x04000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x08000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x10000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
ADD r32, 0x20000          L: 0.22ns=  0.9c  T: 0.08ns=  0.34c
ADD r8, r8                L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
ADD r16, r16              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
ADD r32, r32              L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
ADD r64, r64              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
ADD r8, [m8]              L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r16, [m16]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r32, [m32]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD r64, [m64]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
ADD [m8], r8              L: 1.19ns=  5.0c  T: 0.32ns=  1.33c
ADD [m16], r16            L: 1.19ns=  5.0c  T: 0.21ns=  0.88c
ADD [m32], r32            L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
ADD [m32 + 8], r32        L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
ADD [m64], r64            L: 1.19ns=  5.0c  T: 0.20ns=  0.85c
ADD [m64 + 16], r64       L: 1.19ns=  5.0c  T: 0.18ns=  0.73c

Intel Core i7-7700K, 4700 MHz (47 x 100) Kaby Lake-S

LOCK ADD [m8], r8         L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m16], r16       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m32], r32       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m32 + 8], r32   L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m64], r64       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
LOCK ADD [m64 + 16], r64  L: 4.01ns= 16.8c  T: 5.12ns= 21.50c

XCHG r8, [m8]             L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
XCHG r16, [m16]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
XCHG r32, [m32]           L: 4.01ns= 16.8c  T: 5.20ns= 21.83c
XCHG r64, [m64]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c

ADD r32, 0x04000          L: 0.33ns=  1.0c  T: 0.12ns=  0.36c
ADD r32, 0x08000          L: 0.31ns=  0.9c  T: 0.12ns=  0.37c
ADD r32, 0x10000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
ADD r32, 0x20000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
ADD r8, r8                L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
ADD r16, r16              L: 0.31ns=  0.9c  T: 0.11ns=  0.32c
ADD r32, r32              L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
ADD r64, r64              L: 0.31ns=  0.9c  T: 0.10ns=  0.31c
ADD r8, [m8]              L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r16, [m16]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r32, [m32]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD r64, [m64]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
ADD [m8], r8              L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
ADD [m16], r16            L: 1.87ns=  5.6c  T: 0.26ns=  0.78c
ADD [m32], r32            L: 1.87ns=  5.6c  T: 0.28ns=  0.84c
ADD [m32 + 8], r32        L: 1.89ns=  5.7c  T: 0.26ns=  0.78c
ADD [m64], r64            L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
ADD [m64 + 16], r64       L: 1.89ns=  5.7c  T: 0.24ns=  0.73c
通知家属抬走 2024-10-09 18:24:11

几个月来我一直在研究指数退避。

CAS 的延迟完全取决于指令是否可以从高速缓存操作或必须从内存操作。通常,给定的内存地址由多个线程进行 CAS 处理(例如,指向队列的条目指针)。如果最近成功的 CAS 是由与当前 CAS 执行器共享高速缓存的逻辑处理器执行的(L1、L2 或 L3,当然更高级别的速度较慢),则指令将在高速缓存上运行并且速度会很快 -几个周期。如果最近成功的 CAS 是由不与当前执行器共享缓存的逻辑核心执行的,则最近 CASer 的写入将使当前执行器的缓存行无效,并且需要进行内存读取 - 这将需要数百个周期。

CAS操作本身非常快——几个周期——问题是内存。

I've been looking into exponential backoff for a few months now.

The latency of CAS is utterly dominated by whether or not the instruction can operate from cache or has to operate from memory. Typically, a given memory address is being CAS'd by a number of threads (say, an entry pointer to a queue). If the most recent successful CAS was performed by a logical processor which shares a cache with the current CAS executer (L1, L2 or L3, although of course the higher levels are slower) then the instruction will operate on cache and will be fast - a few cycles. If the most recent successful CAS was performed by a logical core which does not share a cache with the current excutor, then the write of the most recent CASer will have invalidated the cache line for the current executor and a memory read is required - this will take hundreds of cycles.

The CAS operation itself is very fast - a few cycles - the problem is memory.

我是有多爱你 2024-10-09 18:24:11

我一直在尝试在 NOP 方面对 CAS 和 DCAS 进行基准测试。

我有一些结果,但我还不相信它们 - 验证正在进行中。

目前,我在 Core i5 上看到 CAS/DCAS 3/5 NOP。在 Xeon 上,我看到 20/22。

这些结果可能完全不正确——我们已警告您。

I've been trying to benchmark CAS and DCAS in terms of NOP.

I have some results, but I don't trust them yet - verification is ongoing.

Currently, I see on Core i5 for CAS/DCAS 3/5 NOPs. On Xeon, I see 20/22.

These results may be completely incorrect - you were warned.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文