我的自旋锁实现是否正确且最优?
我正在使用自旋锁来保护非常小的关键部分。争用很少发生,因此自旋锁比常规互斥体更合适。
我当前的代码如下,并假设 x86 和 GCC:
volatile int exclusion = 0;
void lock() {
while (__sync_lock_test_and_set(&exclusion, 1)) {
// Do nothing. This GCC builtin instruction
// ensures memory barrier.
}
}
void unlock() {
__sync_synchronize(); // Memory barrier.
exclusion = 0;
}
所以我想知道:
- 这段代码正确吗?它是否正确地确保了互斥?
- 它适用于所有 x86 操作系统吗?
- 它也适用于 x86_64 吗?在所有操作系统上?
- 它是最优的吗?
- 我见过使用比较和交换的自旋锁实现,但我不确定哪个更好。
- 根据 GCC 原子内置文档 (http: //gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html)还有
__sync_lock_release
。我不是内存屏障方面的专家,所以我不确定是否可以使用它来代替__sync_synchronize
。 - 我正在针对不存在争用的情况进行优化。
我根本不关心争用。可能有 1 个或 2 个其他线程尝试每隔几天锁定一次自旋锁。
I'm using a spin lock to protect a very small critical section. Contention happens very rarely so a spin lock is more appropriate than a regular mutex.
My current code is as follows, and assumes x86 and GCC:
volatile int exclusion = 0;
void lock() {
while (__sync_lock_test_and_set(&exclusion, 1)) {
// Do nothing. This GCC builtin instruction
// ensures memory barrier.
}
}
void unlock() {
__sync_synchronize(); // Memory barrier.
exclusion = 0;
}
So I'm wondering:
- Is this code correct? Does it correctly ensure mutual exclusion?
- Does it work on all x86 operating systems?
- Does it work on x86_64 too? On all operating systems?
- Is it optimal?
- I've seen spin lock implementations using compare-and-swap but I'm not sure which is better.
- According to the GCC atomic builtins documentation (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html) there's also
__sync_lock_release
. I'm not an expert on memory barriers so I'm not sure whether it's okay for me to use this instead of__sync_synchronize
. - I'm optimizing for the case in which there's no contention.
I do not care at all about contention. There may be 1, maybe 2 other threads trying to lock the spin lock once every few days.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
对我来说看起来不错。顺便说一句,这里是教科书实现,即使在有争议的情况下也更有效。
Looks fine to me. Btw, here is the textbook implementation that is more efficient even in the contended case.
所以我想知道:
在提到的上下文中,我会说是的。
这是一个有内涵的问题。通过重新发明轮子,您还重新发明了许多已通过其他实现解决的问题
我预计在您不尝试访问锁定字的情况下会出现失败循环。
在解锁中使用完整屏障只需要具有释放语义(这就是为什么您要使用 __sync_lock_release,这样您就可以在 itanium 上获得 st1.rel 而不是 mf,或者在 powerpc 上获得 lwsync,... )。如果您真的只关心 x86 或 x86_64,那么这里使用或不使用的屏障类型并不重要(但如果您在哪里跳转到英特尔的 itanium 以获得 HP-IPF 端口,那么您就不会想要这个)。
您没有通常放在废物循环之前的pause()指令。
当存在争用时,你想要一些东西,semop,甚至在绝望中沉睡。如果您确实需要这能为您带来的性能,那么 futex 建议可能是一个不错的选择。如果您需要性能,这会让您维护这段代码变得足够糟糕,您需要做很多研究。
请注意,有评论说不需要释放屏障。即使在 x86 上也不是这样,因为释放屏障还充当编译器的指令,不要在“屏障”周围洗牌其他内存访问。非常类似于使用 asm ("" ::: "memory" ) 所得到的结果。
在 x86 上,sync_lock_test_and_set 将映射到具有隐含锁定前缀的 xchg 指令。绝对是最紧凑的生成代码(特别是如果您使用字节作为“锁定字”而不是 int),但正确性不亚于使用 LOCK CMPXCHG。比较和交换的使用可用于更高级的算法(例如在失败时将指向第一个“服务员”的元数据的非零指针放入锁字中)。
So I'm wondering:
In the context mentioned, I would say yes.
That's a loaded question. By reinventing the wheel you are also reinventing a lot of problems that have been solved by other implementations
I'd expect a waste loop on failure where you aren't trying to access the lock word.
Use of a full barrier in the unlock only needs to have release semantics (that's why you'd use __sync_lock_release, so that you'd get st1.rel on itanium instead of mf, or a lwsync on powerpc, ...). If you really only care about x86 or x86_64 the types of barriers used here or not don't matter as much (but if you where to make the jump to intel's itanium for an HP-IPF port then you wouldn't want this).
you don't have the pause() instruction that you'd normally put before your waste loop.
when there is contention you want something, semop, or even a dumb sleep in desperation. If you really need the performance that this buys you then the futex suggestion is probably a good one. If you need the performance this buys you bad enough to maintain this code you have a lot of research to do.
Note that there was a comment saying that the release barrier wasn't required. That isn't true even on x86 because the release barrier also serves as an instruction to the compiler to not shuffle other memory accesses around the "barrier". Very much like what you'd get if you used asm ("" ::: "memory" ).
On x86 the sync_lock_test_and_set will map to a xchg instruction which has an implied lock prefix. Definitely the most compact generated code (esp. if you use a byte for the "lock word" instead of an int), but no less correct than if you used LOCK CMPXCHG. Use of compare and swap can be used for fancier algorthims (like putting a non-zero pointer to metadata for the first "waiter" into the lockword on failure).
回答你的问题:
unlock()
情况下使用__sync_lock_release()
可能会稍微好一些;因为这将减少锁并在单个操作中添加内存屏障。然而,假设你的断言很少会出现争论;我觉得不错。In response to your questions:
__sync_lock_release()
in theunlock()
case; as this will decrement the lock and add a memory barrier in a single operation. However, assuming that your assertion that there will rarely be contention; it looks good to me.如果您使用的是最新版本的 Linux,则可以使用 futex - - “快速用户空间互斥体”:
在无争议的情况下,您尝试使用自旋锁进行优化,futex 的行为就像自旋锁一样,不需要内核系统调用。如果锁被竞争,则等待发生在内核中,而不是忙等待。
If you're on a recent version of Linux, you may be able to use a futex -- a "fast userspace mutex":
In the uncontested case, which you're trying to optimize for with your spinlock, the futex will behave just like a spinlock, without requiring a kernel syscall. If the lock is contested, the waiting takes place in the kernel without busy-waiting.
我想知道以下 CAS 实现在 x86_64 上是否正确。
在我的 i7 X920 笔记本电脑(fedora 13 x86_64,gcc 4.4.5)上,速度几乎快了一倍。
I wonder if the following CAS implementation is the correct one on x86_64.
It is almost twice faster on my i7 X920 laptop (fedora 13 x86_64, gcc 4.4.5).
我无法评论正确性,但在我阅读问题正文之前,你的问题标题就提出了危险信号。同步基元很难确保正确性...如果可能的话,您最好使用精心设计/维护的库,也许 pthreads 或 提升: :线程。
I can't comment on correctness, but the title of your question raised a red flag before I even read the question body. Synchronization primitives are devilishly hard to ensure correctness... if at all possible, you're better off using a well-designed/maintained library, perhaps pthreads or boost::thread.
有一些错误的假设。
首先,只有当资源被锁定在另一个 CPU 上时,SpinLock 才有意义。如果资源被锁定在同一个CPU上(在单处理器系统上总是这种情况),您需要放松调度程序才能解锁资源。您当前的代码将在单处理器系统上运行,因为调度程序会自动切换任务,但这会浪费资源。
在多处理器系统上,可能会发生同样的情况,但任务可能会从一个 CPU 迁移到另一个 CPU。简而言之,如果你保证你的任务将在不同的CPU上运行,那么使用自旋锁是正确的。
其次,当互斥锁被解锁时,锁定互斥锁的速度很快(与自旋锁一样快)。仅当互斥体已被锁定时,互斥体锁定(和解锁)才会很慢(非常慢)。
因此,对于您的情况,我建议使用互斥体。
There are a few wrong assumptions.
First, SpinLock makes sense only if ressource is locked on another CPU. If ressource is locked on same CPU (which is always the case on uniprocessor systems), you need to relax scheduler in order unlock ressource. You current code will work on uniprocessor system because scheduler will switch tasks automaticaly, but it a waste of ressource.
On multi-processor system, same thing can happends, but task may migrate from one CPU to another. In short, use of spin lock is correct if you garantee that your tasks will run on different CPU.
Secondly, locking a mutex IS fast (as fast as spinlock) when is is unlocked. Mutexes locking (and unlocking) is slow (very slow) only if mutex is already locked.
So, in your case, I suggest to use mutexes.
建议的一项改进是使用 TATAS (test-and-test-and-放)。使用 CAS 操作对于处理器来说被认为是相当昂贵的,所以如果可能的话最好避免它们。
另一件事,确保您不会遭受优先级反转的困扰(如果高优先级的线程尝试获取锁,而低优先级的线程尝试释放锁怎么办?例如,在 Windows 上,这个问题最终将通过以下方式解决:调度程序使用优先级提升,但是如果您在最近 20 次尝试中没有成功获取锁,您可以显式放弃线程的时间片(例如..)
One improvement is suggest is using TATAS (test-and-test-and-set). Using CAS operations are considered quite expensive for the processor, so it's better to avoid them if possible.
Another thing, make sure you won't suffer from priority inversion (what if a thread with a high priority tries to acquire the lock while a thread with low priority tries to free the lock? On Windows for example this issue will ultimately by solved by the scheduler using a priority boost, but you can explicitly give up your thread's time slice in case you didn't succeed in acquiring the lock in you last 20 tries (for example..)
为了使自旋锁实现更加高效,只要锁获取失败,您就可以让出调度程序。
To make your spinlock implementation more efficient , you can yield to the scheduler whenever the lock acquisition fails .
您的解锁过程不需要内存屏障;只要它在 x86 上双字对齐,对排除的分配就是原子的。
Your unlock procedure doesn't need the memory barrier; the assignment to exclusion is atomic as long as it dword aligned on the x86.
在 x86 (32/64) 的特定情况下,我认为解锁代码中根本不需要内存栅栏。 x86 不进行任何重新排序,只是存储首先放入存储缓冲区中,因此可以延迟其他线程对它们的可见性。执行存储然后从同一变量读取的线程如果尚未将其刷新到内存,则将从其存储缓冲区中读取。因此,您所需要的只是一个
asm
语句来防止编译器重新排序。从其他线程的角度来看,您可能会面临一个线程持有锁的时间比所需时间稍长的风险,但如果您不关心争用,那也没关系。事实上,pthread_spin_unlock
在我的系统(linux x86_64)上是这样实现的。我的系统还使用
lock decl lockvar 实现了
而不是使用pthread_spin_lock
; jne spinloop;xchg
(这是__sync_lock_test_and_set
使用的),但我不知道是否确实存在性能差异。In the specific case of x86 (32/64) I don't think you need a memory fence at all in the unlock code. x86 doesn't do any reordering, except that stores are first put in a store buffer and so them becoming visible can be delayed for other threads. And a thread that does a store and then reads from the same variable will read from its store buffer if it has not yet been flushed to memory. So all you need is an
asm
statement to prevent compiler reorderings. You run the risk of one thread holding the lock slightly longer than necessary from the perspective of other threads, but if you don't care about contention that shouldn't matter. In fact,pthread_spin_unlock
is implemented like that on my system (linux x86_64).My system also implements
pthread_spin_lock
usinglock decl lockvar; jne spinloop;
instead of usingxchg
(which is what__sync_lock_test_and_set
uses), but I don't know if there's actually a performance difference.