无锁算法真的比全锁算法性能更好吗?
Raymond Chen 一直在做巨大 系列 无锁 算法。除了 InterlockedXxx 函数的简单情况之外,所有这些函数的普遍模式似乎是它们实现自己的锁。当然,没有处理器锁,但在每个 CPU 上一遍又一遍循环以确保一致性的概念非常类似于自旋锁。作为自旋锁,它们的效率将低于操作系统附带的通用锁,因为它们在等待其他线程时不会放弃对其量子的控制。因此,每当有人来找我说“但是我的算法是无锁的”时,我的一般反应是“所以”?
我很好奇——是否有可用的基准来显示无锁算法比全锁算法有优势?
Raymond Chen has been doing a huge series on lockfree algorithms. Beyond the simple cases of the InterlockedXxx
functions, it seems like the prevailing pattern with all of these is that they implement their own locks. Sure, there are not processor locks, but the concept of looping over and over on each CPU to ensure consistency is very much like a spinlock. And being a spinlock, they are going to be less efficient than the general locks that come with the operating system because they do not yield control of their quanta while waiting for other threads. Therefore, whenever someone comes to me and says "but my algorithm is lock-free", my general response is "so"?
I'm curious -- are there benchmarks available which show lock free algorithms to have an edge over their lock-full counterparts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
这里的答案似乎都没有真正触及“无锁”之间区别的核心 CAS 循环和互斥锁或自旋锁。
重要的区别在于,无锁算法保证能够自行取得进展,无需其他线程的帮助。使用锁或自旋锁,任何无法获取锁的不良线程都完全受拥有该锁的线程的支配。无法获取锁的可怜线程除了等待(通过繁忙等待或操作系统辅助睡眠)之外什么也做不了。
通过在 CAS 上循环的无锁算法,无论其他竞争线程在做什么,都可以保证每个线程都能取得进展。从本质上讲,每个线程都控制着自己的命运。是的,它仍然可能需要循环很多次,但是它循环的次数受到竞争线程数量的限制。在大多数情况下,它不能无限循环。 (实际上,由于 LL/SC 等原因,可能会发生实时锁定 循环由于错误共享而不断失败) - 但线程本身可以采取措施来处理这个问题 - 它不受另一个持有锁的线程的支配。
至于性能,就看情况了。我见过一些明显的例子,即使在高线程争用的情况下,无锁算法的性能也完全优于其对应的锁定算法。在运行 Debian 7 的 x86-64 机器上,我比较了 C++ Boost.Lockfree 队列(基于 Michael/Scott 算法)和由
包围的普通旧
。在高线程争用的情况下,无锁版本几乎慢了一倍。std::queue
之间的性能>std::互斥体那么这是为什么呢?好吧,无锁算法的性能最终取决于实现细节。算法如何避免ABA?它是如何实现安全内存回收的呢?有很多变体...标记指针、基于纪元的回收、RCU/静态、危险指针、一般进程范围的垃圾收集等。所有这些策略都会对性能产生影响,有些策略还会对应用程序的总体方式施加限制可以设计。一般来说,根据我的经验,引用计数方法(或标记指针方法)往往表现不佳。但替代方案的实现可能要复杂得多,并且需要更多基于线程本地存储或广义垃圾收集的内存回收基础设施。
None of the answers here really seem to get to the heart of the difference between a "lock-free" CAS loop and a mutex or spin-lock.
The important difference is that lock-free algorithms are guaranteed to make progress on their own - without the assistance of other threads. With a lock or spin lock, any poor thread that can't acquire a lock is entirely at the mercy of the thread that owns the lock. The poor thread that can't acquire the lock can do nothing except wait (either via a busy wait or an OS-assisted sleep).
With lock-free algorithms that loop on a CAS, each thread is guaranteed to make progress regardless of what other contending threads are doing. Each thread is, essentially, in control of its own fate. Yes, it still may have to loop many times, but the number of times it loops is limited by the number of contending threads. It cannot infinitely loop, for the most part. (In practice, it's possible for live lock to occur due to, e.g. an LL/SC loop that keeps failing due to false sharing) - but again measures can be taken by the thread itself to deal with this - it is not at the mercy of another thread holding a lock.
As for performance, it depends. I've seen flagrant examples of lock-free algorithms being totally out-performed by their locking counterparts, even under high-thread contention. On an x86-64 machine running Debian 7, I compared the performance between the C++ Boost.Lockfree queue (based on the Michael/Scott algorithm) and a plain old
std::queue
surround by anstd::mutex
. Under high thread contention, the lockfree version was almost twice as slow.So why is that? Well, the performance of lockfree algorithms ultimately comes down to the implementation details. How does the algorithm avoid ABA? How does it accomplish safe memory reclamation? There are so many variants... tagged pointers, epoch based reclamation, RCU/quiescent state, hazard pointers, general process-wide garbage collection, etc. All these strategies have performance implications, and some also place restrictions on how your application in general can be designed. In general, reference-counting approaches (or tagged pointer approaches) tend to perform poorly, in my experience. But the alternatives can be much more complex to implement, and require a lot more memory-reclamation infrastructure based around thread-local storage or generalized garbage collection.
一般来说,无锁算法的每个线程效率较低 - 正如您所提到的,为了实现无锁算法,您需要做更多的工作,而不是简单的锁。
然而,面对竞争时,它们确实倾向于显着提高整个算法的整体吞吐量。 线程切换延迟 和 上下文切换,这种切换速度很快,在多个线程上会显着降低应用程序的吞吐量。无锁算法有效地实现了它们自己的“锁”,但它们这样做的方式阻止或减少了上下文切换的数量,这就是为什么它们往往比它们的锁定算法表现更好。
话虽如此,这大部分取决于所讨论的算法(和实现)。例如,我有一些例程,我设法切换到 .NET 4 的新并发集合,而不是使用以前的锁定机制,并且测量到我的总算法速度提高了近 30%。话虽这么说,您可以发现许多基准测试表明,与基本锁相比,使用其中一些相同的集合会降低性能。与所有性能优化一样 - 在测量之前您真的不知道。
In general, lock free algorithms are less efficient per thread - you're doing more work, as you mentioned, in order to implement a lock free algorithm than a simple lock.
However, they do tend to dramatically improve the overall throughput of the algorithm as a whole in the face of contention. Thread switching latency and context switches, which fast, over many threads slow down the throughput of your application dramatically. Lock free algorithms effectively are implementing their own "locks," but they do so in a manner that prevents or reduces the number of context switches, which is why they tend to out perform their locking counterparts.
That being said - most of this depends on the algorithm (and implementation) in question. For example, I've got some routines that I managed to switch to .NET 4's new concurrent collections instead of using the previous locking mechanisms, and have measured improvements over near 30% in my total algorithm speed. That being said, there are many benchmarks you can find that show reduced performance using some of these same collections when compared to a basic lock. As with all performance optimizations - you really don't know until you measure.
无锁不一定更快,但它可以消除死锁或活锁的可能性,因此您可以保证您的程序始终朝着完成的方向前进。对于锁来说,很难做出任何这样的保证——很容易错过一些可能的执行序列,从而导致死锁。
除此之外,一切都取决于。至少根据我的经验,速度的差异往往更多地取决于实现中部署的技能水平,而不是是否使用锁。
Lock-free isn't necessarily any faster, but it can eliminate the possibility of deadlock or livelock, so you can guarantee that your program will always make progress toward finishing. With locks, it's difficult to make any such guarantee -- it's all too easy to miss some possible execution sequence that results in a deadlock.
Past that, it all depends. At least in my experience, differences in speed tend to depend more on the skill level deployed in the implementation than whether it uses locks or not.
在 x64 上的 Windows 下,简单的(在空闲列表前面没有组合数组)无锁空闲列表比基于互斥体的空闲列表快大约一个数量级。
在我的笔记本电脑 (Core i5) 上,对于单线程、无锁,我每秒大约进行 3100 万次自由列表操作,而对于互斥锁,每秒大约进行 230 万次操作。
对于两个线程(在单独的物理核心上),在无锁的情况下,每个线程可以获得大约 1240 万个空闲列表操作。使用互斥锁,我每秒可以进行大约 80千 次操作。
Under Windows on x64, a straightforward (no combining array in front of the freelist) lock-free freelist is about an order of magnitude faster than a mutex based freelist.
On my laptop (Core i5), for a single thread, lock-free I get about 31 million freelist operations per second, vs for mutex about 2.3 million operations per second.
For two threads (on separate physical cores), with lock-free I get about 12.4 million freelist operations per thread. With a mutex, I get about 80 THOUSAND operations per second.
真正无锁算法的主要优点是,即使任务被拦截,它们也很稳健(请注意,无锁是比“不使用锁”(*) 更严格的条件)。虽然避免不必要的锁定具有性能优势,但性能最佳的数据结构通常是那些在许多情况下可以操作锁定,但可以使用锁来最大限度地减少抖动的数据结构。
(*)我见过一些“无锁”多生产者队列的尝试,其中在错误时间被拦截的生产者将阻止消费者看到任何新项目,直到它完成其工作);这种数据结构实际上不应该被称为“无锁”。一个被阻止的生产者不会阻止其他生产者取得进展,但可能会任意阻止消费者。
The primary advantage of genuinely lock-free algorithms is that they are robust even if a task gets waylaid (note that lock free is a tougher condition than "not using locks"(*)). While there are performance advantages to avoiding unnecessary locking, the best-performing data structures are often those which can operate locking in many cases, but which can use locks to minimize thrashing.
(*)I've seen some attempts at "lock free" multi-producer queues where a producer that got waylaid at the wrong time would prevent consumers from seeing any new items until it completed its work); such data structures shouldn't really be called "lock free". One producer that gets blocked won't block other producers from making progress, but may arbitrarily block consumers.
无锁算法绝对比阻塞算法更快。当然,反之亦然。假设该实现比锁定对应部分执行得更好,则唯一的限制因素是争用。
以两个 Java 类 ConcurrentLinkedQueue 和 LinkedBlockingQueue 为例。在现实世界中适度的竞争下,CLQ 的表现远远优于 LBQ。在竞争激烈的情况下,使用挂起线程将使 LBQ 性能更好。
我不同意 user237815 的观点。 Synchronized 关键字不再需要像以前那样多的开销,但相对于无锁算法,与单个 CAS 相比,它确实具有大量与之相关的开销。
Lock-free algorithms can absolutely be faster then their blocking counterpart. But of course the inverse is true as well. Assuming the implementation performs better then the locking counter part, the only limiting factor is contention.
Take the two Java classes, ConcurrentLinkedQueue and LinkedBlockingQueue. Under moderate real world contention the CLQ outperforms the LBQ by a good amount. With heavy contention the use of suspending threads will allow the LBQ to perform better.
I disagree with user237815. synchronized keyword doesn't require as much overhead as it once did, but relative to a lock-free algorithm it does have a good amount of overhead associated to it compared to a single CAS.
最近,在 [JavaOne Russia][1] 上,Oracle 员工(专门从事 Java 性能和基准测试)展示了一些有关使用 CAS(无锁、事实上是高级自旋锁)和经典锁(
java.util.concurrent.locks.ReentrantLock
)。据此,只有少数线程尝试访问监视器时,自旋锁才具有更好的性能。
Recently on [JavaOne Russia][1] Oracle employee (who specializes on Java performance and benchmarks) have showed some measurements about operations per second within parallel access to simple
int
counter, using CAS (lock-free, high-level spinlock in fact) and classic locks (java.util.concurrent.locks.ReentrantLock
).According to this, spin-locks have better performance only until the few number of threads tries to access monitor.
至少在 Java 中,锁定本身可以非常快。同步关键字不会增加很多开销。您只需在循环中调用同步方法即可自行对其进行基准测试。
仅当存在争用时锁定才会变慢,并且被锁定的进程不是瞬时的。
In Java, at least, locking by itself can be very quick. The synchronized keyword doesn't add a lot of overhead. You can benchmark it yourself just by calling a synchronized method in a loop.
Locking only gets slow when there is contention, and the process being locked isn't instantaneous.
无锁还有一个优点就是不休眠。内核中有些地方是不允许你睡觉的——Windows 内核有很多这样的地方——这极大地限制了你使用数据结构的能力。
Lock-free also has the advantage that it does not sleep. There are places in kernels where you are not permitted to sleep - the Windows kernel has a bunch of them - and that painfully restricts your ability to use data structures.
下面的图片(src)可能会给你一个更好地理解查尔斯·萨尔维亚的回答:
The following pic (src) may give you a better understanding of Charles Salvia's answer:
是的,无锁可以确保进度,但是除非您手动中止某些平台上可能的线程,或者在关键部分中分配并获得内存不足异常,或者类似的任何愚蠢的事情,否则您不需要那样做。
正确实现的自旋锁几乎总是胜过无锁方法(如果性能不一样),因为通常您第一次或在尝试不成功之后会做更多的工作。
如果你保持旋转时间短并且使用比较交换指令压垮CPU和/或在一段时间后不后退,为其他线程提供线程时间片(这为超出调度的线程唤醒并释放锁提供了机会),那么无锁代码可以执行更好。除此之外我认为这是不可能的。我不感兴趣,也不喜欢自旋锁不适合的复杂数据类型,但我仍然觉得正确设计的基于锁的算法几乎总是更好。不过我可能是错的。
Yes lock-freedom ensures progress but unless you manually abort threads which is possible on some platforms or allocate in critical section and get out of memory exception, or anything stupid like that, you dont need that.
Properly implemented spinlock almost always beats lockless approaches if not performs equal, because typically youll be doing more work for first time or after unsuccesfull attempts.
If you keep spinning time short and overwhelm cpus with compare exchange instructions and/or dont back off after some period giving thread timeslice to other threads (which gives opportunity to out-scheduled thread to wake up and release lock) then lockfree code can perform better. Other than that i dont think its possible. Im not interested, nor was into complex data types where spinlock not fit, but still i sense properly designed lock-based alghoritms will be almost always better. i may be wrong though.