使用 cmpxchg 的 x86 自旋锁

发布于 2024-11-27 13:43:43 字数 193 浏览 0 评论 0原文

我是使用 gcc 内联汇编的新手,想知道在 x86 多核机器上是否可以将自旋锁(无竞争条件)实现为(使用 AT&T 语法):

spin_lock:
mov 0 eax
lock cmpxchg 1 [lock_addr]
jnz spin_lock
ret

spin_unlock:
lock mov 0 [lock_addr]
ret

I'm new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T syntax):

spin_lock:
mov 0 eax
lock cmpxchg 1 [lock_addr]
jnz spin_lock
ret

spin_unlock:
lock mov 0 [lock_addr]
ret

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清风疏影 2024-12-04 13:43:43

您的想法是正确的,但您的 asm 已损坏:

cmpxchg 无法使用立即操作数,只能使用寄存器。

lock 不是 mov 的有效前缀。在 x86 上,mov 到对齐地址是原子的,因此您无论如何都不需要 lock

自从我使用 AT&T 语法以来已经有一段时间了,希望我记住了一切:

spin_lock:
    xorl   %ecx, %ecx
    incl   %ecx            # newVal = 1
spin_lock_retry:
    xorl   %eax, %eax      # expected = 0
    lock; cmpxchgl %ecx, (lock_addr)
    jnz    spin_lock_retry
    ret

spin_unlock:
    movl   $0,  (lock_addr)    # atomic release-store
    ret

请注意,GCC 有原子内置函数,因此您实际上不需要使用内联汇编来完成此操作:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

void spin_unlock(int volatile *p)
{
    asm volatile ("":::"memory"); // acts as a memory barrier.
    *p = 0;
}

正如 Bo 下面所说,锁定指令会导致成本:您使用的每个人都必须获得对缓存行的独占访问权限并锁定它当 lock cmpxchg 运行时,类似于正常存储到该缓存行,但在 lock cmpxchg 执行期间保持不变。这可能会延迟解锁线程,尤其是在多个线程正在等待获取锁的情况下。即使没有很多 CPU,它仍然很容易且值得进行优化:

void spin_lock(int volatile *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1))
    {
        // spin read-only until a cmpxchg might succeed
        while(*p) _mm_pause();  // or maybe do{}while(*p) to pause first
    }
}

当您的代码像这样旋转时,pause 指令对于超线程 CPU 的性能至关重要 - 它让第二个线程执行当第一个线程正在旋转时。在不支持 pause 的 CPU 上,它被视为 nop

pause 还可以防止在离开自旋循环、最终再次进行实际工作时发生内存顺序错误推测。 x86 中“PAUSE”指令的用途是什么?

请注意,自旋锁实际上很少使用:通常,人们使用诸如临界区或 futex 之类的东西。它们集成了一个自旋锁,以在低争用情况下提高性能,但随后又回到操作系统辅助的睡眠和通知机制。他们还可能采取措施来提高公平性,以及 cmpxchg / pause 循环无法执行的许多其他操作。


另请注意,对于简单的自旋锁来说,cmpxchg 是不必要的:您可以使用xchg,然后检查旧值是否为 0。在锁定指令内执行更少的工作可能会使缓存行保持固定的时间更短。有关完整的汇编,请参阅通过内联汇编锁定内存操作使用 xchgpause 实现(但仍然没有回退到操作系统辅助睡眠,只是无限期地旋转。)

You have the right idea, but your asm is broken:

cmpxchg can't work with an immediate operand, only registers.

lock is not a valid prefix for mov. mov to an aligned address is atomic on x86, so you don't need lock anyway.

It has been some time since I've used AT&T syntax, hope I remembered everything:

spin_lock:
    xorl   %ecx, %ecx
    incl   %ecx            # newVal = 1
spin_lock_retry:
    xorl   %eax, %eax      # expected = 0
    lock; cmpxchgl %ecx, (lock_addr)
    jnz    spin_lock_retry
    ret

spin_unlock:
    movl   $0,  (lock_addr)    # atomic release-store
    ret

Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

void spin_unlock(int volatile *p)
{
    asm volatile ("":::"memory"); // acts as a memory barrier.
    *p = 0;
}

As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:

void spin_lock(int volatile *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1))
    {
        // spin read-only until a cmpxchg might succeed
        while(*p) _mm_pause();  // or maybe do{}while(*p) to pause first
    }
}

The pause instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause, it is treated as a nop.

pause also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?

Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg / pause loop doesn't do.


Also note that cmpxchg is unnecessary for a simple spinlock: you can use xchg and then check whether the old value was 0 or not. Doing less work inside the locked instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg and pause (but still with no fallback to OS-assisted sleep, just spinning indefinitely.)

臻嫒无言 2024-12-04 13:43:43

这将减少内存总线上的争用:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}

This will put less contention on the memory bus:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}
狼性发作 2024-12-04 13:43:43

语法错误。稍加修改后就可以工作了。

spin_lock:
    movl $0, %eax
    movl $1, %ecx
    lock cmpxchg %ecx, (lock_addr)
    jnz spin_lock
    ret
spin_unlock:
    movl $0, (lock_addr)
    ret

提供运行速度更快的代码。假设 lock_addr 存储在 %rdi redister 中。

使用 movltest 而不是 lock cmpxchgl %ecx, (%rdi) 进行旋转。

仅当有机会时才使用 lock cmpxchgl %ecx, (%rdi) 尝试进入临界区。

这样就可以避免不必要的总线锁定。

spin_lock:
    movl $1, %ecx
loop:
    movl (%rdi), %eax
    test %eax, %eax
    jnz loop
    lock cmpxchgl %ecx, (%rdi)
    jnz loop
    ret
spin_unlock:
    movl $0, (%rdi)
    ret

我已经使用 pthread 和这样的简单循环对其进行了测试。

for(i = 0; i < 10000000; ++i){
    spin_lock(&mutex);
    ++count;
    spin_unlock(&mutex);
}

在我的测试中,第一个需要2.5~3秒,第二个需要1.3~1.8秒。

The syntax is wrong. It works after a little modification.

spin_lock:
    movl $0, %eax
    movl $1, %ecx
    lock cmpxchg %ecx, (lock_addr)
    jnz spin_lock
    ret
spin_unlock:
    movl $0, (lock_addr)
    ret

To provide a code running faster. Assume lock_addr is store in %rdi redister.

Use movl and test instead of lock cmpxchgl %ecx, (%rdi) to spin.

Use lock cmpxchgl %ecx, (%rdi) for trying to enter critical section only if there's a chance.

Then could avoid unneeded bus locking.

spin_lock:
    movl $1, %ecx
loop:
    movl (%rdi), %eax
    test %eax, %eax
    jnz loop
    lock cmpxchgl %ecx, (%rdi)
    jnz loop
    ret
spin_unlock:
    movl $0, (%rdi)
    ret

I have tested it using pthread and an easy loop like this.

for(i = 0; i < 10000000; ++i){
    spin_lock(&mutex);
    ++count;
    spin_unlock(&mutex);
}

In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文