什么时候无锁数据结构的性能低于互斥(互斥体)?

发布于 2024-08-08 07:51:05 字数 248 浏览 2 评论 0原文

我在某处读到(不再找到该页面),无锁数据结构“对于某些工作负载”更有效,这似乎意味着有时它们实际上更慢,或者在某些情况下从它们中获得的收益可能为零。对我来说,使用锁定指令的约 100 个周期命中来执行原子操作听起来比进入睡眠状态并等待调度程序唤醒进程要快得多,因此对我来说,在什么情况下使用无锁数据结构并不明显不如老式互斥锁那么好。如果锁在 99% 的时间内可用并且进程不必进入睡眠状态,那么互斥锁是否会更快?假设有合适的无锁数据结构可用,是否有一个好的经验法则可以知道该走哪条路?

I read somewhere (can't find the page anymore) that lock free data structures are more efficient "for certain workloads" which seems to imply that sometimes they're actually slower or the gain from them can be zero in some situations. Taking the ~100 cycle hit of a lock instruction to do an atomic op sounds plenty faster to me than going to sleep and waiting for the scheduler to wake the process back up, so it's not obvious to me under what circumstances a lock free data structure would be less preferable than old fashioned mutexes. If the lock is available 99% of the time and the process doesn't have to go to sleep, is a mutex then faster? Is there a good rule of the thumb for knowing which way to go assuming a suitable lock free data structure is available?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

各自安好 2024-08-15 07:51:05

实现无锁数据结构的常见方法是拥有对不可变对象的可变引用,并让任何想要更改该结构的对象获取该引用,生成应用了适当更改的对象的新版本,然后进行 CompareExchange指向新对象的引用。如果 CompareExchange 能工作,那就太好了。如果没有,则放弃新对象,重新获取引用,然后重新开始。

如果生成新对象的成本较低并且争用程度足够低,CompareExchange 通常可以正常工作,则这种方法可以很好地工作。如果存在相当大的争用,并且生成新对象的速度很慢,则 N 个线程同时尝试更新可能需要 N^2 时间才能完成。举一个极端的例子,假设一个 CPU 上运行 100 个线程,一次更新需要 100 毫秒的 CPU 时间(刚好超过一个时间片),并且所有 100 个线程都尝试同时更新一个对象。在前十秒内,每个线程都会在原始对象的基础上生成一个新对象。其中一个线程将成功执行 CompareExchange,而其他线程将全部失败。然后在接下来的 9.9 秒内,99 个线程将生成该对象的新版本,之后其中一个将成功发布其更新,而 98 个将失败。最终结果是,无锁方法将花费 505 秒的 CPU 时间来执行 100 次更新,而锁定系统可以在大约 10 秒内完成这些更新。

A common approach to implementing a lock-free data structure is to have a mutable reference to an immutable object, and have anything that wants to change the structure grab the reference, produce a new version of the object with suitable changes applied, and then CompareExchange the reference to point to the new object. If the CompareExchange works, great. If not, ditch the new object, re-grab the reference, and start over.

This can work well if producing the new object is cheap and the level of contention is low enough that the CompareExchange will usually work. If there is considerable contention, and if producing the new object is slow, simultaneous attempted updates by N threads may take N^2 time to complete. As an extreme example, suppose 100 threads are running on a CPU, an update takes 100ms of CPU time (just over a time-slice), and all 100 threads attempt to update an object at once. During the first ten seconds, each thread will produce a new object based on the original one. One of the threads will successfully do the CompareExchange, while the others will all fail. Then during the next 9.9 seconds, 99 threads will generate new versions of the object, after which one will successfully post its update and 98 will fail. The net effect will be that the lock-free method will take 505 seconds' worth of CPU time to perform 100 updates, when a locking system could have done them in about 10 seconds.

碍人泪离人颜 2024-08-15 07:51:05

无锁数据结构将通过某种方式使用架构中的原子语义来执行其核心操作。当您执行此操作时,您可以使用计算机的整个内部排除机制来确保数据的正确排序或隔离。互斥体或临界区执行此操作,但它只对单个标志执行一次。互斥锁或临界区速度慢的地方是锁获取失败(存在争用)。在这种情况下,操作系统还会调用调度程序来挂起线程,直到释放排除对象。

因此,每当您的无锁数据结构在每个核心方法中使用多个原子操作时,当屏蔽关键部分的单个锁可以提供相同的语义并且时,在实践中往往很少有争用,这似乎是合乎逻辑的,对于所讨论的数据结构,那么事实上,使用操作系统提供的锁定机制确实比尝试构建自己的锁定机制更有意义。

lockless data structures will, one way or another, use atomic semantics from your architecture to perform its core operations. When you do this, you can be using the machines entire internal exclusion mechanisms to ensure correct ordering or fencing of data. A mutex or critical section also does this, but it only does it once for a single flag. Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released.

So it seems logical that whenever your lock-less data structure uses multiple atomic operations per core method when a single lock shielding a critical section could provide the same semantics AND there tends to be very little contention, in practice, for the data structure in question, then in fact, it does make more sense to use an OS-provided locking mechanism, than trying to build your own.

千鲤 2024-08-15 07:51:05

我不知道如何让它,但这肯定会让正确的事情变得更加困难。在许多情况下,两种方法的性能几乎相同(或者当需要 500 皮秒而不是 100 皮秒并不重要时),那么选择最简单的方法 - 通常是lock.

在极少数情况下,额外的性能是关键。如果是的话,我怀疑您最好使用已建立的库中的预滚动模式实现。让无锁代码正常工作(并证明在所有条件下都能正常工作)通常非常困难。

另请注意,某些环境提供的锁定级别高于操作系统提供的互斥锁;互斥行为,但没有一些开销(例如 .NET 中的 Monitor)。

I don't know about making it slower, but it certainly makes it harder to get right. In the many cases where the two approaches are virtually identical in performance (or when it simply doesn't matter if it takes 500 pico-seconds rather than 100 pico-seconds), then pick the simplest approach - generally lock.

There are very few cases when that extra bit of performance is key; and if it is, I suspect you'd do well to use the pre-rolled pattern implementations from established libraries. Getting lock-free code working properly (and proving that it works properly in all conditions) is often very hard.

Note also that some environments offer a level of locking above the OS-provided mutex; mutex behaviour, but without some of the overheads (for example, Monitor in .NET).

药祭#氼 2024-08-15 07:51:05

我想对这部分答案补充一点:
“互斥锁或临界区速度慢的地方是锁获取失败(存在争用)。在这种情况下,操作系统还会调用调度程序来挂起线程,直到释放排除对象为止。”

似乎不同的操作系统在锁获取失败时可以采取不同的方法。我使用 HP-UX,它有一种更复杂的方法来锁定互斥体。这是它的描述:

...另一方面,改变上下文是一个昂贵的过程。如果等待时间很短,我们宁愿不进行上下文切换。为了平衡这些要求,当我们尝试获取信号量并发现它被锁定时,我们要做的第一件事就是短暂的自旋等待。调用例程 psema_spin_1() 来旋转最多 50,000 个时钟周期以尝试获取锁定。如果我们在 50,000 个周期后未能获得锁,则调用 psema_switch_1() 放弃处理器并让另一个进程接管。

I would like to add one point to this part of the answer:
"Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released."

Seems like different operating systems can have different approaches as to what to do when lock acquisition failed. I use HP-UX and it for example has a more sophisticated approach to locking mutexes. Here is its description:

... On the other hand, changing context is an expensive process. If the wait is going to be a short one, we'd rather not do the context switch. To balance out these requirements, when we try to get a semaphore and find it locked, the first thing we do is a short spin wait. The routine psema_spin_1() is called to spin for up to 50,000 clock cycles trying to get the lock. If we fail to get the lock after 50,000 cycles, we then call psema_switch_1() to give up the processor and let another process take over.

七色彩虹 2024-08-15 07:51:05

请记住,互斥锁很可能被实现为一种无锁数据结构,因为它使用一个或几个原子对象来表示其状态。这是一种错误的二分法。

更好的是考虑是否需要允许多个线程等待访问某些操作集或阻塞直到收到信号。每个都需要一个等待线程队列。前者对等待访问同步区域的线程进行排队,而后者对等待信号的线程进行排队。 Java 类 AbstractQueuedSynchronizer< /code>AbstractQueuedLongSynchronizer 提供了这样一个队列,特别是CLH 队列,可以在其上构建互斥体、条件和其他基于队列的原语。

如果您的要求倾向于只有一个线程承担一组独占的工作,而其他线程仍然可以自由地继续其他工作,而不是等到它们也可以自己完成相同的工作,那么使用无锁技术是可能的。这样做是否会获得更快的运行时间取决于基准测试,取决于线程争用这些同步控制的频率和数量,以及线程是否有其他工作需要独立执行。

Keep in mind that a mutex may well be implemented as a lock-free data structure, in the sense that it uses one or a few atomic objects to represent its state. It's a false dichotomy.

Better is to consider whether you need to allow multiple threads to wait for access to some set of operations or to block until signaled. Each requires a queue of waiting threads. The former queues threads waiting for access to the synchronized area, while the latter queues threads waiting for a signal. The Java classes AbstractQueuedSynchronizer and AbstractQueuedLongSynchronizer provide such a queue—in particular, a CLH Queue—upon which one can build mutexes, conditions, and other queue-based primitives.

If your requirements favor instead only one thread taking on an exclusive set of work while other threads remain free to carry on with other work, as opposed to waiting until they too can do that same work themselves, then using lock-free techniques is possible. Whether doing so will grant faster run times falls to benchmarking, being subject to how often and how many threads will contend over these synchronization controls, and whether there's other work for threads to perform independently.

孤者何惧 2024-08-15 07:51:05

效率取决于指标。在抢占可能导致死锁或影响调度期限的系统中,无锁或无等待算法非常重要。在这些情况下,处理并不比正确性重要。

OP 认为锁定是互斥体的替代方案。有些算法不需要访问共享数据结构。在这些情况下,生产者和消费者都可以同时访问相同的数据结构,而无需考虑对方。 共享队列的示例 允许单个读取器和单个写入器同时对共享实例进行操作。这满足了设备驱动程序写入消费者进程可以按需访问的数据的常见需求。

进程之间可以允许更复杂的关系(参见 Herlihy (1991) 进行分析) 具有不同级别的硬件支持。他的结论是,无等待同步代表了与用于实现并发对象的传统基于锁定的技术的质的突破。

这意味着仍然存在一种权衡,但这不仅仅是在互斥锁和自旋锁之间进行选择。

经验法则仍然是关注正确性而不是性能。性能通常可以通过花钱解决问题来实现,而满足要求通常更困难。

Efficiency depends on the metric. Lock-, or wait-free algorithms are important in systems where preemption can introduce deadlock or affect scheduling deadlines. In those cases, processing is less important than correctness.

The OP considers locking as an alternative to mutexes. Some algorithms require neither to access a shared data structure. In these cases, both producer and consumer can access the same data structure concurrently without regard for the other. An example of a shared queue permits a single reader and a single writer to simultaneously act on a shared instance. This meets the common need of a device driver writing data that a consumer process can access on demand.

More complex relationships between processes can be permitted (see Herlihy (1991) for an analysis) with varying levels of hardware support. He concludes Wait-free synchronization represents a qualitative break with the traditional locking-based techniques for implementing concurrent objects.

What it means is that there remains a trade-off, but that it is not one simply between choosing between mutexes and spinlocks.

A rule of thumb remains to focus on correctness rather than performance. Performance can usually be achieved by throwing money at the problem, while meeting requirements is usually more difficult.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文