临界区总是更快吗?

发布于 2024-07-20 12:46:04 字数 597 浏览 10 评论 0原文

我正在调试一个多线程应用程序,发现了CRITICAL_SECTION的内部结构。 我发现 CRITICAL_SECTION 的数据成员 LockSemaphore 是一个有趣的成员。

看起来 LockSemaphore 是一个自动重置事件(顾名思义,不是信号量),当线程第一次等待关键部分时,操作系统会默默地创建此事件,其中被其他线程锁定。

现在,我想知道关键部分总是更快吗? 事件是一个内核对象,每个关键部分对象都与事件对象关联,那么与互斥体等其他内核对象相比,关键部分如何更快? 另外,内部事件对象实际上如何影响关键部分的性能?

以下是 CRITICAL_SECTION 的结构:

struct RTL_CRITICAL_SECTION
{
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;
    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;
};

I was debugging a multi-threaded application and found the internal structure of CRITICAL_SECTION. I found data member LockSemaphore of CRITICAL_SECTION an interesting one.

It looks like LockSemaphore is an auto-reset event (not a semaphore as the name suggests) and operating system creates this event silently when first time a thread waits on Critcal Section which is locked by some other thread.

Now, I am wondering is Critical Section always faster? Event is a kernel object and each Critical section object is associated with event object then how Critical Section can be faster compared to other kernel objects like Mutex? Also, how does internal event object actually affects the performance of Critical section ?

Here is the structure of the CRITICAL_SECTION:

struct RTL_CRITICAL_SECTION
{
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;
    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;
};

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

伤痕我心 2024-07-27 12:46:04

当他们说关键部分“快”时,他们的意思是“当它尚未被另一个线程锁定时获取它很便宜”。

[请注意,如果它已经被另一个线程锁定,那么它的速度有多快并不重要。]

它之所以快是因为,在进入内核之前,它使用相当于这些 LONG 字段之一上的 InterlockedIncrement(可能在 LockCount 字段上),如果成功,则认为无需获取锁已经进入内核了。

我认为 InterlockedIncrement API 在用户模式下实现为“LOCK INC”操作码……换句话说,您可以获取一个无争议的关键部分,而无需对内核进行任何环转换。

When they say that a critical section is "fast", they mean "it's cheap to acquire one when it isn't already locked by another thread".

[Note that if it is already locked by another thread, then it doesn't matter nearly so much how fast it is.]

The reason why it's fast is because, before going into the kernel, it uses the equivalent of InterlockedIncrement on one of those LONG field (perhaps on the the LockCount field) and if it succeeds then it considers the lock aquired without having gone into the kernel.

The InterlockedIncrement API is I think implemented in user mode as a "LOCK INC" opcode ... in other words you can acquire an uncontested critical section without doing any ring transition into the kernel at all.

青芜 2024-07-27 12:46:04

在性能工作中,很少有东西属于“总是”类别:)如果您自己使用其他原语实现类似于操作系统关键部分的东西,那么在大多数情况下,速度会更慢。

回答您的问题的最佳方法是性能测量。 操作系统对象的执行方式非常取决于场景。 例如,如果争用较低,则关键部分通常被认为是“快”。 如果锁定时间小于自旋计数时间,它们也被认为是快速的。

要确定的最重要的事情是关键部分的争用是否是应用程序中的首要限制因素。 如果没有,那么只需正常使用关键部分并解决应用程序的主要瓶颈(或瓶颈)。

如果关键部分的性能至关重要,那么您可以考虑以下事项。

  1. 仔细设置“热门”关键部分的自旋锁计数。 如果性能至关重要,那么这里的工作是值得的。 请记住,虽然自旋锁确实避免了用户模式到内核的转换,但它会以惊人的速度消耗 CPU 时间 - 在自旋时,没有其他任何东西可以使用该 CPU 时间。 如果锁持有的时间足够长,那么旋转线程实际上会阻塞,从而释放 CPU 来执行其他工作。
  2. 如果您有读取器/写入器模式,请考虑使用 Slim Reader /写入器(SRW)锁定。 缺点是它们仅适用于 Vista 和 Windows Server 2008 及更高版本的产品。
  3. 您可以将条件变量与您的关键部分以最大限度地减少轮询和争用,仅在需要时唤醒线程。 同样,这些在 Vista 和 Windows Server 2008 及更高版本的产品上受支持。
  4. 考虑使用互锁单链表 (SLIST) -这些都是高效且“无锁”的。 更好的是,它们在 XP 和 Windows Server 2003 及更高版本的产品上受支持。
  5. 检查您的代码 - 您也许可以通过重构某些代码并使用互锁操作或 SLIST 进行同步和通信来打破“热”锁。

总之,调整存在锁争用的场景可能是一项具有挑战性(但很有趣!)的工作。 专注于测量应用程序性能并了解热路径在哪里。 Windows 性能工具包 中的 xperf 工具是您的朋友:)我们刚刚在适用于 Windows 7 和 .NET Framework 3.5 SP1 的 Microsoft Windows SDK 中发布了版本 4.5 (ISO 在这里网络安装程序位于此处)。 您可以在此处找到 xperf 工具的论坛。 V4.5完全支持Win7、Vista、Windows Server 2008 - 所有版本。

In performance work, few things fall into the "always" category :) If you implement something yourself that is similar to an OS critical section using other primitives then odds are that will be slower in most cases.

The best way to answer your question is with performance measurements. How OS objects perform is very dependent on the scenario. For example, critical sections are general considered 'fast' if contention is low. They are also considered fast if the lock time is less than the spin count time.

The most important thing to determine is if contention on a critical section is the first order limiting factor in your application. If not, then simply use a critical section normaly and work on your applications primary bottleneck (or necks).

If critical section performance is critical, then you can consider the following.

  1. Carefully set the spin lock count for your 'hot' critical sections. If performance is paramount, then the work here is worth it. Remember, while the spin lock does avoid the user mode to kernel transition, it consumes CPU time at a furious rate - while spinning, nothing else gets to use that CPU time. If a lock is held for long enough, then the spinning thread will actual block, freeing up that CPU to do other work.
  2. If you have a reader/writer pattern then consider using the Slim Reader/Writer (SRW) locks. The downside here is they are only available on Vista and Windows Server 2008 and later products.
  3. You may be able to use condition variables with your critical section to minimize polling and contention, waking threads only when needed. Again, these are supported on Vista and Windows Server 2008 and later products.
  4. Consider using Interlocked Singly Linked Lists (SLIST)- these are efficient and 'lock free'. Even better, they are supported on XP and Windows Server 2003 and later products.
  5. Examine your code - you may be able to break up a 'hot' lock by refactoring some code and using an interlocked operation, or SLIST for synchronization and communication.

In summary - tuning scenarios that have lock contention can be challenging (but interesting!) work. Focus on measuring your applications performance and understanding where your hot paths are. The xperf tools in the Windows Performance Tool kit is your friend here :) We just released version 4.5 in the Microsoft Windows SDK for Windows 7 and .NET Framework 3.5 SP1 (ISO is here, web installer here). You can find the forum for the xperf tools here. V4.5 fully supports Win7, Vista, Windows Server 2008 - all versions.

流年里的时光 2024-07-27 12:46:04

CriticalSections 速度更快,但 InterlockedIncrement/InterlockedDecrement 更多。 请参阅此实现使用示例LightweightLock 完整副本

CriticalSections is faster, but InterlockedIncrement/InterlockedDecrement is more. See this implementation usage sample LightweightLock full copy.

万劫不复 2024-07-27 12:46:04

CriticalSections 将旋转一小会儿(几毫秒)并继续检查锁是否空闲。 自旋计数“超时”后,它将回退到内核事件。 因此,在锁的持有者快速退出的情况下,您永远不必进行昂贵的内核代码转换。

编辑:在我的代码中发现了一些注释:显然 MS 堆管理器使用的旋转计数为 4000(整数增量,而不是毫秒)

The CriticalSections will spin a short while (few ms) and keep checking if the lock is free. After the spin count 'times out', it will then fall back to the kernel event. So in the case where the holder of the lock gets out quickly, you never have to make the expensive transition to kernel code.

EDIT: Went and found some comments in my code: apparently the MS Heap Manager uses a spin count of 4000 (integer increments, not ms)

欲拥i 2024-07-27 12:46:04

有一种看待它的方法:

如果没有争用,那么与进入互斥体的内核模式相比,自旋锁确实很快。

当存在争用时,CriticalSection 比直接使用 Mutex 的成本稍高(因为检测自旋锁状态需要额外的工作)。

因此,它归结为加权平均值,其中权重取决于调用模式的具体情况。 话虽如此,如果争用很少,那么 CriticalSection 就是巨大的胜利。 另一方面,如果您始终存在大量争用,那么您将比直接使用互斥体付出很小的代价。 但在这种情况下,通过切换到互斥体获得的好处很小,因此您最好尝试减少争用。

Here's a way to look at it:

If there's no contention, then the spin lock is really fast compared to going to kernel mode for a Mutex.

When there is contention, a CriticalSection is slightly more expensive than using a Mutex directly (because of the extra work to detect the spinlock state).

So it boils down to a weighted average, where the weights depend on the specifics of your calling pattern. That being said, if you have little contention, then a CriticalSection is big win. If, on the other hand, you consistently have lots of contention, then you'd be paying a very small penalty over using a Mutex directly. But in that case, what you'd gain by switching to a Mutex is small, so you'd probably be better off trying to reduce the contention.

夏见 2024-07-27 12:46:04

为什么临界区比互斥体更快,因为临界区不是内核对象。 这是当前进程的全局内存的一部分。 互斥体实际上驻留在内核中,并且互斥体对象的创建需要内核切换,但在临界区的情况下则不需要。 尽管临界区速度很快,但是当线程进入等待状态时,使用临界区会发生内核切换。 这是因为线程调度发生在内核端。

Critical section is faster than mutex why because critical section is not a kernel object. This is part of global memory of the current process. Mutex actually resides in Kernel and creation of mutext object requires a kernel switch but in case of critical section not. Even though critical section is fast, there will be a kernel switch while using critical section when threads are going to wait state. This is because thread scheduling happens in kernel side.

遮云壑 2024-07-27 12:46:04

根据我的经验和实验,与 pthreads 实现相比,CRITICAL_SECTION 速度非常慢。

当将相同代码与 pthread 实现进行比较时,极端意味着当锁定/解锁数量很大时,切换线程大约慢 10 倍

因此,我再也没有使用过关键部分; pthreads 也可以在 MS Windows 上使用,性能噩梦终于结束了。

From my experience and experiments,CRITICAL_SECTION is extremely slow compared to pthreads implementation.

Extremely means around 10 times slower for switching threads when number of locks/unlocks is big, when comparing the same code with pthread implementation.

I thus never use Critical Section again; pthreads are also available on MS Windows and finally the performance nightmares are over.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文