使用 cmpxchg 的 x86 自旋锁
我是使用 gcc 内联汇编的新手,想知道在 x86 多核机器上是否可以将自旋锁(无竞争条件)实现为(使用 AT&T 语法):
spin_lock: mov 0 eax lock cmpxchg 1 [lock_addr] jnz spin_lock ret spin_unlock: lock mov 0 [lock_addr] ret
I'm new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T syntax):
spin_lock: mov 0 eax lock cmpxchg 1 [lock_addr] jnz spin_lock ret spin_unlock: lock mov 0 [lock_addr] ret
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的想法是正确的,但您的 asm 已损坏:
cmpxchg
无法使用立即操作数,只能使用寄存器。lock
不是mov
的有效前缀。在 x86 上,mov
到对齐地址是原子的,因此您无论如何都不需要lock
。自从我使用 AT&T 语法以来已经有一段时间了,希望我记住了一切:
请注意,GCC 有原子内置函数,因此您实际上不需要使用内联汇编来完成此操作:
正如 Bo 下面所说,锁定指令会导致成本:您使用的每个人都必须获得对缓存行的独占访问权限并锁定它当
lock cmpxchg
运行时,类似于正常存储到该缓存行,但在lock cmpxchg
执行期间保持不变。这可能会延迟解锁线程,尤其是在多个线程正在等待获取锁的情况下。即使没有很多 CPU,它仍然很容易且值得进行优化:当您的代码像这样旋转时,
pause
指令对于超线程 CPU 的性能至关重要 - 它让第二个线程执行当第一个线程正在旋转时。在不支持pause
的 CPU 上,它被视为nop
。pause
还可以防止在离开自旋循环、最终再次进行实际工作时发生内存顺序错误推测。 x86 中“PAUSE”指令的用途是什么?请注意,自旋锁实际上很少使用:通常,人们使用诸如临界区或 futex 之类的东西。它们集成了一个自旋锁,以在低争用情况下提高性能,但随后又回到操作系统辅助的睡眠和通知机制。他们还可能采取措施来提高公平性,以及
cmpxchg
/pause
循环无法执行的许多其他操作。另请注意,对于简单的自旋锁来说,
cmpxchg
是不必要的:您可以使用xchg
,然后检查旧值是否为 0。在锁定指令内执行更少的工作可能会使缓存行保持固定的时间更短。有关完整的汇编,请参阅通过内联汇编锁定内存操作使用xchg
和pause
实现(但仍然没有回退到操作系统辅助睡眠,只是无限期地旋转。)You have the right idea, but your asm is broken:
cmpxchg
can't work with an immediate operand, only registers.lock
is not a valid prefix formov
.mov
to an aligned address is atomic on x86, so you don't needlock
anyway.It has been some time since I've used AT&T syntax, hope I remembered everything:
Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:
As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while
lock cmpxchg
runs, like for a normal store to that cache line but held for the duration oflock cmpxchg
execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:The
pause
instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't supportpause
, it is treated as anop
.pause
also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the
cmpxchg
/pause
loop doesn't do.Also note that
cmpxchg
is unnecessary for a simple spinlock: you can usexchg
and then check whether the old value was 0 or not. Doing less work inside thelock
ed instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation usingxchg
andpause
(but still with no fallback to OS-assisted sleep, just spinning indefinitely.)这将减少内存总线上的争用:
This will put less contention on the memory bus:
语法错误。稍加修改后就可以工作了。
提供运行速度更快的代码。假设
lock_addr
存储在%rdi
redister 中。使用
movl
和test
而不是lock cmpxchgl %ecx, (%rdi)
进行旋转。仅当有机会时才使用
lock cmpxchgl %ecx, (%rdi)
尝试进入临界区。这样就可以避免不必要的总线锁定。
我已经使用 pthread 和这样的简单循环对其进行了测试。
在我的测试中,第一个需要2.5~3秒,第二个需要1.3~1.8秒。
The syntax is wrong. It works after a little modification.
To provide a code running faster. Assume
lock_addr
is store in%rdi
redister.Use
movl
andtest
instead oflock cmpxchgl %ecx, (%rdi)
to spin.Use
lock cmpxchgl %ecx, (%rdi)
for trying to enter critical section only if there's a chance.Then could avoid unneeded bus locking.
I have tested it using pthread and an easy loop like this.
In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.