Posix 上的关键部分？

发布于 2024-08-15 13:52:49 字数 548 浏览 5 评论 0原文

Windows API 提供了关键部分，等待线程在上下文切换之前将在其中旋转有限的次数，但仅限于多处理器系统。这些是使用InitializeCriticalSectionAndSpinCount 实现的。（请参阅http://msdn.microsoft.com/en-us/library/ ms682530.aspx。）当您有一个通常只会锁定很短时间的关键部分时，这种方法非常有效，因此争用不应立即触发上下文切换。两个相关问题：

对于高级跨平台线程库或同步块的实现，在触发上下文切换之前进行少量旋转是否是一个好的默认设置？
如果有的话，什么与其他操作系统（尤其是 Posix）上的 InitializeCriticalSectionAndSpinCount 等效？

编辑：当然，没有旋转计数对于所有情况都是最佳的。我只关心使用非零旋转计数是否会比不使用非零旋转计数更好的默认值。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

離殇 2024-08-22 13:52:49

我的观点是，获得最佳应用程序性能的最佳“旋转计数”过于依赖硬件，无法成为跨平台 API 的重要组成部分，并且您可能应该只使用互斥体（在 posix 中，pthread_mutex_init< /code> / destroy / lock / trylock）或自旋锁 (pthread_spin_init / destroy< /代码> / <代码>锁定 / <代码>尝试锁定）。理由如下。

旋转计数有什么意义？基本上，如果锁所有者与尝试获取锁的线程同时运行，那么锁所有者可能会足够快地释放锁，以便 EnterCriticalSection 调用者可以避免在获取锁时放弃 CPU 控制，从而提高该线程的性能，并避免上下文切换开销。两件事：

1：显然这依赖于与尝试获取锁的线程并行运行的锁所有者。这在单个执行核心上是不可能的，这几乎可以肯定是 Microsoft 在此类环境中将计数视为 0 的原因。即使有多个核心，当另一个线程尝试获取锁时，锁所有者也很可能没有运行，在这种情况下，最佳自旋计数（对于该尝试）仍然是 0。

2：在同时执行的情况下，最佳自旋数计数仍然依赖于硬件。不同的处理器将花费不同的时间来执行类似的操作。它们有不同的指令集（我使用的大多数ARM没有整数除法指令），不同的缓存大小，操作系统在内存中将有不同的页面......减少旋转计数可能需要不同的时间量加载-存储架构与算术指令可以直接访问内存的架构相比。即使在相同的处理器上，相同的任务也会花费不同的时间，这取决于（至少）内存缓存的内容和组织。

如果同时执行的最佳旋转计数是无限的，那么 pthread_spin_* 函数应该执行您所追求的操作。如果不是，则使用 pthread_mutex_* 函数。

My opinion is that the optimal "spin-count" for best application performance is too hardware-dependent for it to be an important part of a cross-platform API, and you should probably just use mutexes (in posix, pthread_mutex_init / destroy / lock / trylock) or spin-locks (pthread_spin_init / destroy / lock / trylock). Rationale follows.

What's the point of the spin count? Basically, if the lock owner is running simultaneously with the thread attempting to acquire the lock, then the lock owner might release the lock quickly enough that the EnterCriticalSection caller could avoid giving up CPU control in acquiring the lock, improving that thread's performance, and avoiding context switch overhead. Two things:

1: obviously this relies on the lock owner running in parallel to the thread attempting to acquire the lock. This is impossible on a single execution core, which is almost certainly why Microsoft treats the count as 0 in such environments. Even with multiple cores, it's quite possible that the lock owner is not running when another thread attempts to acquire the lock, and in such cases the optimal spin count (for that attempt) is still 0.

2: with simultaneous execution, the optimal spin count is still hardware dependent. Different processors will take different amounts of time to perform similar operations. They have different instruction sets (the ARM I work with most doesn't have an integer divide instruction), different cache sizes, the OS will have different pages in memory... Decrementing the spin count may take a different amount of time on a load-store architecture than on an architecture in which arithmetic instructions can access memory directly. Even on the same processor, the same task will take different amounts of time, depending on (at least) the contents and organization of the memory cache.

If the optimal spin count with simultaneous execution is infinite, then the pthread_spin_* functions should do what you're after. If it is not, then use the pthread_mutex_* functions.

回复收藏 0 原文