临界区总是更快吗?
我正在调试一个多线程应用程序,发现了CRITICAL_SECTION
的内部结构。 我发现 CRITICAL_SECTION 的数据成员 LockSemaphore
是一个有趣的成员。
看起来 LockSemaphore 是一个自动重置事件(顾名思义,不是信号量),当线程第一次等待关键部分时,操作系统会默默地创建此事件,其中被其他线程锁定。
现在,我想知道关键部分总是更快吗? 事件是一个内核对象,每个关键部分对象都与事件对象关联,那么与互斥体等其他内核对象相比,关键部分如何更快? 另外,内部事件对象实际上如何影响关键部分的性能?
以下是 CRITICAL_SECTION
的结构:
struct RTL_CRITICAL_SECTION
{
PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
LONG LockCount;
LONG RecursionCount;
HANDLE OwningThread;
HANDLE LockSemaphore;
ULONG_PTR SpinCount;
};
I was debugging a multi-threaded application and found the internal structure of CRITICAL_SECTION
. I found data member LockSemaphore
of CRITICAL_SECTION an interesting one.
It looks like LockSemaphore
is an auto-reset event (not a semaphore as the name suggests) and operating system creates this event silently when first time a thread waits on Critcal Section
which is locked by some other thread.
Now, I am wondering is Critical Section always faster? Event is a kernel object and each Critical section object is associated with event object then how Critical Section
can be faster compared to other kernel objects like Mutex? Also, how does internal event object actually affects the performance of Critical section ?
Here is the structure of the CRITICAL_SECTION
:
struct RTL_CRITICAL_SECTION
{
PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
LONG LockCount;
LONG RecursionCount;
HANDLE OwningThread;
HANDLE LockSemaphore;
ULONG_PTR SpinCount;
};
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
当他们说关键部分“快”时,他们的意思是“当它尚未被另一个线程锁定时获取它很便宜”。
[请注意,如果它已经被另一个线程锁定,那么它的速度有多快并不重要。]
它之所以快是因为,在进入内核之前,它使用相当于这些
LONG
字段之一上的InterlockedIncrement
(可能在LockCount
字段上),如果成功,则认为无需获取锁已经进入内核了。我认为 InterlockedIncrement API 在用户模式下实现为“LOCK INC”操作码……换句话说,您可以获取一个无争议的关键部分,而无需对内核进行任何环转换。
When they say that a critical section is "fast", they mean "it's cheap to acquire one when it isn't already locked by another thread".
[Note that if it is already locked by another thread, then it doesn't matter nearly so much how fast it is.]
The reason why it's fast is because, before going into the kernel, it uses the equivalent of
InterlockedIncrement
on one of thoseLONG
field (perhaps on the theLockCount
field) and if it succeeds then it considers the lock aquired without having gone into the kernel.The
InterlockedIncrement
API is I think implemented in user mode as a "LOCK INC" opcode ... in other words you can acquire an uncontested critical section without doing any ring transition into the kernel at all.在性能工作中,很少有东西属于“总是”类别:)如果您自己使用其他原语实现类似于操作系统关键部分的东西,那么在大多数情况下,速度会更慢。
回答您的问题的最佳方法是性能测量。 操作系统对象的执行方式非常取决于场景。 例如,如果争用较低,则关键部分通常被认为是“快”。 如果锁定时间小于自旋计数时间,它们也被认为是快速的。
要确定的最重要的事情是关键部分的争用是否是应用程序中的首要限制因素。 如果没有,那么只需正常使用关键部分并解决应用程序的主要瓶颈(或瓶颈)。
如果关键部分的性能至关重要,那么您可以考虑以下事项。
总之,调整存在锁争用的场景可能是一项具有挑战性(但很有趣!)的工作。 专注于测量应用程序性能并了解热路径在哪里。 Windows 性能工具包 中的 xperf 工具是您的朋友:)我们刚刚在适用于 Windows 7 和 .NET Framework 3.5 SP1 的 Microsoft Windows SDK 中发布了版本 4.5 (ISO 在这里,网络安装程序位于此处)。 您可以在此处找到 xperf 工具的论坛。 V4.5完全支持Win7、Vista、Windows Server 2008 - 所有版本。
In performance work, few things fall into the "always" category :) If you implement something yourself that is similar to an OS critical section using other primitives then odds are that will be slower in most cases.
The best way to answer your question is with performance measurements. How OS objects perform is very dependent on the scenario. For example, critical sections are general considered 'fast' if contention is low. They are also considered fast if the lock time is less than the spin count time.
The most important thing to determine is if contention on a critical section is the first order limiting factor in your application. If not, then simply use a critical section normaly and work on your applications primary bottleneck (or necks).
If critical section performance is critical, then you can consider the following.
In summary - tuning scenarios that have lock contention can be challenging (but interesting!) work. Focus on measuring your applications performance and understanding where your hot paths are. The xperf tools in the Windows Performance Tool kit is your friend here :) We just released version 4.5 in the Microsoft Windows SDK for Windows 7 and .NET Framework 3.5 SP1 (ISO is here, web installer here). You can find the forum for the xperf tools here. V4.5 fully supports Win7, Vista, Windows Server 2008 - all versions.
CriticalSections 速度更快,但 InterlockedIncrement/InterlockedDecrement 更多。 请参阅此实现使用示例LightweightLock 完整副本。
CriticalSections is faster, but InterlockedIncrement/InterlockedDecrement is more. See this implementation usage sample LightweightLock full copy.
CriticalSections 将旋转一小会儿(几毫秒)并继续检查锁是否空闲。 自旋计数“超时”后,它将回退到内核事件。 因此,在锁的持有者快速退出的情况下,您永远不必进行昂贵的内核代码转换。
编辑:在我的代码中发现了一些注释:显然 MS 堆管理器使用的旋转计数为 4000(整数增量,而不是毫秒)
The CriticalSections will spin a short while (few ms) and keep checking if the lock is free. After the spin count 'times out', it will then fall back to the kernel event. So in the case where the holder of the lock gets out quickly, you never have to make the expensive transition to kernel code.
EDIT: Went and found some comments in my code: apparently the MS Heap Manager uses a spin count of 4000 (integer increments, not ms)
有一种看待它的方法:
如果没有争用,那么与进入互斥体的内核模式相比,自旋锁确实很快。
当存在争用时,CriticalSection 比直接使用 Mutex 的成本稍高(因为检测自旋锁状态需要额外的工作)。
因此,它归结为加权平均值,其中权重取决于调用模式的具体情况。 话虽如此,如果争用很少,那么 CriticalSection 就是巨大的胜利。 另一方面,如果您始终存在大量争用,那么您将比直接使用互斥体付出很小的代价。 但在这种情况下,通过切换到互斥体获得的好处很小,因此您最好尝试减少争用。
Here's a way to look at it:
If there's no contention, then the spin lock is really fast compared to going to kernel mode for a Mutex.
When there is contention, a CriticalSection is slightly more expensive than using a Mutex directly (because of the extra work to detect the spinlock state).
So it boils down to a weighted average, where the weights depend on the specifics of your calling pattern. That being said, if you have little contention, then a CriticalSection is big win. If, on the other hand, you consistently have lots of contention, then you'd be paying a very small penalty over using a Mutex directly. But in that case, what you'd gain by switching to a Mutex is small, so you'd probably be better off trying to reduce the contention.
为什么临界区比互斥体更快,因为临界区不是内核对象。 这是当前进程的全局内存的一部分。 互斥体实际上驻留在内核中,并且互斥体对象的创建需要内核切换,但在临界区的情况下则不需要。 尽管临界区速度很快,但是当线程进入等待状态时,使用临界区会发生内核切换。 这是因为线程调度发生在内核端。
Critical section is faster than mutex why because critical section is not a kernel object. This is part of global memory of the current process. Mutex actually resides in Kernel and creation of mutext object requires a kernel switch but in case of critical section not. Even though critical section is fast, there will be a kernel switch while using critical section when threads are going to wait state. This is because thread scheduling happens in kernel side.
根据我的经验和实验,与
pthreads
实现相比,CRITICAL_SECTION
速度非常慢。当将相同代码与 pthread 实现进行比较时,极端意味着当锁定/解锁数量很大时,切换线程大约慢 10 倍。
因此,我再也没有使用过关键部分; pthreads 也可以在 MS Windows 上使用,性能噩梦终于结束了。
From my experience and experiments,
CRITICAL_SECTION
is extremely slow compared topthreads
implementation.Extremely means around 10 times slower for switching threads when number of locks/unlocks is big, when comparing the same code with pthread implementation.
I thus never use Critical Section again; pthreads are also available on MS Windows and finally the performance nightmares are over.