x86 暂停指令在自旋锁中如何工作以及它可以在其他场景中使用吗？

发布于 2024-10-12 14:18:17 字数 401 浏览 19 评论 0原文

pause指令通常用在测试spinlock的循环中，当其他一些线程拥有自旋锁，以减轻紧密循环。据说相当于一些NOP指令。有人能告诉我它到底是如何用于自旋锁优化的吗？在我看来，即使是 NOP 指令也是浪费 CPU 时间。它们会降低 CPU 使用率吗？

另一个问题是我可以将暂停指令用于其他类似的目的吗？例如，我有一个繁忙的线程，它不断扫描某些地方（例如队列）以检索新节点；然而，有时队列是空的，线程只是在浪费CPU时间。睡眠线程并由其他线程唤醒它可能是一种选择，但是线程很关键，所以我不想让它睡眠。

可以为了我的目的暂停指令工作以减轻 CPU 使用率吗？目前它使用物理核心的 100% cpu 吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北斗星光 2024-10-19 14:18:17

PAUSE 通知 CPU 这是一个自旋锁等待循环，因此可以优化内存和缓存访问。另请参阅 x86 中的暂停指令，了解有关避免内存顺序错误推测的更多详细信息离开自旋环。

PAUSE实际上可能会停止CPU一段时间以节省电量。较旧的 CPU 将其解码为 REP NOP，因此您不必检查其是否受支持。较旧的 CPU 将尽可能快地不执行任何操作 (NOP)。

另请参阅https://software.intel.com /en-us/articles/benefitting-power-and-performance-sleep-loops

更新：我认为在队列检查中使用 PAUSE 不是一个好主意，除非你想让你的队列像自旋锁一样（并且没有明显的方法可以做到这一点）。

即使使用 PAUSE，长时间旋转仍然很糟糕。

回复收藏 0 原文

呆头 2024-10-19 14:18:17

处理器在退出时会遭受严重的性能损失
循环，因为它检测到可能的内存顺序冲突。暂停指令
向处理器提供代码序列是自旋等待循环的提示。这
在大多数情况下，处理器使用此提示来避免内存顺序冲突，
这极大地提高了处理器性能。为此，建议
将 PAUSE 指令放置在所有自旋等待循环中。
PAUSE指令的另一个功能是降低Intel处理器的功耗。

[来源：英特尔手册]

回复收藏 0 原文

茶花眉 2024-10-19 14:18:17

基于暂停的旋转等待循环

正如我从您的问题中了解到的那样，您的情况中的等待时间事先已知会很长。在这种情况下，根本不建议使用自旋等待循环。但如果您使用的自旋循环不断检查内存中的值（例如字节大小的同步变量），请使用PAUSE。请参阅 Intel 64 和 IA-32 架构优化参考手册。

您写道，您有一个“不断扫描某些位置（例如队列）以检索新节点的线程”。

在这种情况下（即长时间等待），英特尔建议使用操作系统的同步 API 函数。例如，您可以在队列中出现新节点时创建一个事件，然后使用 WaitForSingleObject(Handle, INFINITE) 等待该事件。每当有新节点出现时，队列就会触发该事件。

根据英特尔优化参考手册第 2.3.4 节“Skylake 客户端微架构中的暂停延迟”，

PAUSE指令通常与软件线程一起使用
在位于同一处理器的两个逻辑处理器上执行
core，等待锁被释放。如此短的等待循环往往会
持续数十到数百个周期，因此从性能角度来看
与其屈服于操作系统，不如等待占用 CPU。

通过上面引用的“数十个和数百个周期”，我理解为 20 到 500 个 CPU 周期。

4500 MHz Intel Core i7 7700K 处理器（2017 年 1 月发布，基于 Kaby-Lake-S 微架构）上的 500 个 CPU 周期为 0.0000001 秒，即 1/10000000 秒：CPU 每秒可以进行 1000 万次这个 500 -CPU 周期循环。

Intel 建议的 500 个周期限制是理论上的，一切都取决于特定的用例，即需要通过自旋等待循环同步的代码逻辑。一些场景，例如 FastMM4-AVX 内存管理器 for Delphi 根据 5000 的值效果更好基准。尽管如此，这些基准并不总是反映现实世界的场景，并且应该测量真实的程序用例。

正如您所看到的，这个基于PAUSE的自旋等待循环的时间非常短。

另一方面，每次调用像 Sleep() 这样的 API 函数都会经历昂贵的上下文切换成本，可能会超过 10000 个周期；它还承受环 3 到环 0 转换的成本，可能需要 1000 多个周期。

如果有更多线程，则处理器核心（乘以超线程功能，如果存在）可用，并且一个线程将在关键部分中间切换到另一个线程，等待另一个线程的关键部分可能真的需要< em>looong，至少 10000+ 个周期，因此基于 PAUSE 的自旋等待循环将是徒劳的。

除了英特尔优化参考手册的相关章节外，请参阅以下文章以获取更多信息：

当等待循环预计持续数千个周期或更长时间时，这是
最好通过调用操作系统同步 API 函数之一来屈服于操作系统，例如 Windows 操作系统上的 WaitForSingleObject 或 SwitchToThread。

结论：在您的场景中，基于 PAUSE 的自旋等待循环不会是最佳选择，因为您的等待时间很长，而自旋等待循环适用于非常短的循环。

在基于 Skylake 微架构的处理器或更高版本的处理器上，PAUSE 指令大约需要 140 个 CPU 周期。例如，在 2015 年 8 月发布的英特尔酷睿 i7-6700K CPU (4GHz) 上仅为 35.10 纳秒，在 2020 年 9 月发布的用于移动设备的英特尔酷睿 i7-1165G7 CPU 上为 49.47 纳秒。在早期处理器（Skylake 之前）上，像基于Haswell微架构的，大约有9个周期。在 2013 年 6 月发布的 Intel Core i5-4430 (3GHz) 上为 2.81ns。因此，对于长循环，最好使用操作系统同步 API 函数将控制权交给其他线程，而不是使用 PAUSE< /code> 循环，无论微架构如何。

测试、测试和设置

请注意，旋转等待循环也必须正确实现。 Intel 建议使用所谓的“测试、测试和设置”技术（请参阅 Intel 64 和 IA-32 架构优化参考手册的第 11.4.3 节“使用自旋锁进行优化”）来确定同步变量的可用性。根据该技术，第一个“测试”是通过正常（非锁定）内存负载完成的，以防止自旋等待循环期间出现过多的总线锁定；如果变量在第一步（“测试”）的非锁定内存加载时可用，则继续执行第二步（“测试和设置”），这是通过总线锁定原子 xchg< 完成的/代码> 指令。

但请注意，与单步“测试和设置”相比，在“测试和设置”之前使用“测试”的这种两步方法可能会增加无争议情况的成本。初始只读访问可能只能获取处于共享状态的缓存行，因此测试和设置 (xchg) 或比较和交换 (cmpxchg) 等原子操作>) 仍然需要“读取所有权”(RFO) 操作来获取缓存行的独占所有权。此操作由试图写入处于共享状态的缓存行的处理器发出。

Pause-based spin-wait loops

As I understood from your questions, the waits in your case are known in advance to be very long. In this case, spin-wait loops are not recommended at all. But if you are using a spin-loop that keeps checking a value from memory (e.g. a byte-sized synchronization variable), use PAUSE. See the Section 11.4.2 "Synchronization for Short Periods" of the Intel 64 and IA-32 Architectures Optimization Reference Manual.

You wrote that you have a "thread which keeps scanning some places (e.g. a queue) to retrieve new nodes".

In such a case (i.e. the long wait), Intel recommends using synchronization API functions of your operating system. For example, you can create an event when a new node appears in a queue, and just wait for this event using the WaitForSingleObject(Handle, INFINITE). The queue will trigger this event whenever a new node will appear.

According to the Intel Optimization Reference Manual, Section, 2.3.4 "Pause Latency in Skylake Client Microarchitecture",

The PAUSE instruction is typically used with software threads
executing on two logical processors located in the same processor
core, waiting for a lock to be released. Such short wait loops tend to
last between tens and a few hundreds of cycles, so performance-wise it
is better to wait while occupying the CPU than yielding to the OS.

By "tens and a few hundreds of cycles" of the above quote I understand from 20 to 500 CPU cycles.

500 CPU cycles on a 4500 MHz Intel Core i7 7700K processor (released on January 2017, based on Kaby-Lake-S microarchitecture) is 0.0000001 seconds, i.e. 1/10000000th of a second: the CPU can make 10 million times per second this 500-CPU-cycles loop.

This 500 cycle limit recommended by Intel is theoretical, and all depends on particular use case, i.e. on the logic of the code that needs to be synchronized by spin-wait loops. Some scenarios like FastMM4-AVX memory manger for Delphi work better with the value of 5000, according to the benchmarks. Even though, these benchmarks do not always reflect real-world scenario, and real program use cases should be measured.

As you see, this PAUSE-based spin-wait loop is for really short periods of time.

On the other hand, each call to an API function like Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.

If there are more threads then the processor cores (multiplied to hyperthreading feature, if present) are available, and a thread will get switched to another one in the middle of a critical section, waiting for the critical section from another thread may really take looong, at least 10000+ cycles, so the PAUSE-based spin-wait loop will be futile.

In addition to the relevant chapters of the Intel Optimization Reference Manual, please see the following articles for more information:

When the wait loop is expected to last for thousands of cycles or more, it is
preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject or SwitchToThread on Windows OS.

As a conclusion: in your scenario, the PAUSE-based spin-wait loop won't be the best choice, since your waiting time is long while the spin-wait loop is intended for very short loops.

The PAUSE instruction takes about 140 CPU cycles on processors based on Skylake microarchitecture, or later processors. For example, it is just or 35.10ns on Intel Core i7-6700K CPU (4GHz) released on August 2015, or 49.47ns on Intel Core i7-1165G7 CPU for mobile devices released on September 2020. On earlier processors (prior to Skylake), like those based on Haswell microarchitecture, it has about 9 cycles. It is 2.81ns on Intel Core i5-4430 (3GHz) released on June 2013. So, for the long loops, it's better to relinquish control to other threads using the OS synchronization API functions than to occupy CPU with the PAUSE loop, regardless of the microarchitecture.

Test, Test-and-Set

Please note that the spin-wait loops have also to be implemented properly. Intel recommends the so-called "test, test-and-set" technique (see Section 11.4.3 "Optimization with Spin-Locks" of the Intel 64 and IA-32 Architectures Optimization Reference Manual) to determine the availability of the synchronization variable. According to this technique, the first "test" is done via the normal (non-locking) memory load to prevent excessive bus locking during the spin-wait loop; if the variable is available upon the non-locking memory load of the first step ("test"), proceed to the second step ("test-and-set") which is done via the bus-locking atomic xchg instruction.

But be aware that this two-steps approach of using "test" before "test-and-set" can increase the cost for the un-contended case comparing to just single-step "test-and-set". The initial read-only access might only get the cache line in Shared state, so the atomic operation like test-and-set (xchg) or compare-and-swap (cmpxchg) still needs a ''Read For Ownership'' (RFO) operation to get exclusive ownership of the cache line. This operation is issued by a processor trying to write into a cache line that is in the Shared state.

回复收藏 0 原文