为什么 CompareAndSwap 指令被认为是昂贵的?

发布于 2024-09-03 19:22:55 字数 107 浏览 6 评论 0原文

为什么 CompareAndSwap 指令被认为是昂贵的?

我在一本书上读到:

“内存屏障很昂贵,大约为 作为原子的比较昂贵() 指导。”

谢谢!

Why is CompareAndSwap instruction considered expensive?

I read in a book:

"Memory barriers are expensive, about as
expensive as an atomic compareAndSet()
instruction."

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

仅此而已 2024-09-10 19:22:55

“CAS 与普通存储没有明显不同。有关 CAS 的一些错误信息可能源于 Intel 处理器上 lock:cmpxchg (CAS) 的原始实现。lock: 前缀导致 LOCK# 信号被断言,这当然无法扩展 lock:cmpxchg 的后续实现(通常是基于侦听的 MESI),并且不会断言 LOCK#。” - David。 Dice,HotSpot 中的偏向锁定

“内存屏障非常昂贵,大约与原子compareAndSet()指令一样昂贵。”

这是千真万确的。
例如,在 x86 上,多处理器系统上的正确 CAS 具有锁定前缀。
锁前缀会导致完整的内存屏障:

“...锁定操作序列化所有未完成的加载和存储操作(即等待它们完成)。” ...“锁定操作相对于所有其他内存操作和所有外部可见事件而言是原子的。只有取指和页表访问才能传递锁定指令。锁定指令可用于同步一个处理器写入的数据和另一处理器读取的数据”。 - 英特尔® 64 和 IA-32 架构软件开发人员手册,章节8.1.2.

实际上,内存屏障在 .NETx86/x64 上的 JAVA JIT
在 x86 上,CAS 会导致完全内存屏障。

在 PPC 上,情况有所不同。 LL/SC 对 - lwarx< /代码> & stwcx- 可用于将内存操作数加载到寄存器中,然后如果没有其他存储到目标位置,则将其写回,或者如果有则重试整个循环。 LL/SC 可以被中断。
它也不意味着自动完整围栏。
不同架构上的性能特征和行为可能有很大不同。
但话又说回来 - 弱 LL/SC 是不是CAS。

"CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#." - David Dice, Biased locking in HotSpot

"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."

This is quite true.
E.g. on x86, a proper CAS on a multi-processor system has a lock prefix.
The lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64.
On x86, CAS results in a full memory barrier.

On PPC, it is different. An LL/SC pair - lwarx & stwcx - can be used to load the memory operand into a register, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
Performance characteristics and behaviour can be very different on different architectures.
But then again - a weak LL/SC is not CAS.

挖个坑埋了你 2024-09-10 19:22:55

这是因为它们引入了额外的开销来使操作原子化。底层平台必须抑制优化(例如缓存)并暂停线程执行以促进屏障,这需要大量额外的工作。当额外的活动正在进行时,线程无法执行,因此整个程序会产生时间延迟。

That's because they introduce extra overhead for making the operation atomic. The underlying platform will have to suppress optimizations (like caching) and suspend threads execution for facilitating the barrier and that requires lots of extra effort. While that extra activity is in progress threads can't execute and so the overall program incurs a time delay.

烟─花易冷 2024-09-10 19:22:55

“贵”在这里是非常相对的。与硬盘访问相比,这绝对是微不足道的。但RAM总线速度还没有跟上现代CPU的速度,并且与CPU内部的算术运算相比,直接访问RAM(即非缓存)是相当昂贵的。从 RAM 中获取 int 的时间很容易比添加两个寄存器的时间长 50 倍。

因此,由于内存屏障基本上强制直接 RAM 访问(可能针对多个 CPU),因此它们相对昂贵。

"expensive" is very relative here. It's absolutely insignificant compared with, say, a harddisk access. But RAM bus speed has not kept up with the speed of modern CPUs, and compared with arithmetic operations inside the CPU, accessing the RAM directly (i.e. non-cached) is quite expensive. It can easily take 50 times as long to fetch an int from RAM than to add two registers.

So, since memory barriers basically force direct RAM access (possibly for multiple CPUs), they are relatively expensive.

吃不饱 2024-09-10 19:22:55

我想我在我的书中找到了答案:

每个 getAndSet() 都会广播到总线。因为
所有线程必须使用总线进行通信
内存,这些 getAndSet() 调用会延迟所有线程(核心),
即使那些不等待锁的人。

更糟糕的是, getAndSet() 调用会强制其他
处理器丢弃自己的缓存副本
锁,所以每个旋转线程都会遇到一个
缓存几乎每次都未命中,并且必须使用
总线来获取新的但未更改的值。

I think I found the answer in my book:

Each getAndSet() is broadcast to the bus. because
all threads must use the bus to communicate with
memory, these getAndSet() calls delay all threads (cores),
even those not waiting for the lock.

Even worse, the getAndSet() call forces other
processors to discard their own cached copies of
the lock, so every spinning thread encounters a
cache miss almost every time, and must use the
bus to fetch the new, but unchanged value.

暮年慕年 2024-09-10 19:22:55

一般来说,原子操作是昂贵的,因为它们需要跨CPU同步。允许“正常”操作对缓存数据进行操作,从而提高速度。以 2 cpu 系统为例:

线程 1

while (1) x++;

线程 2

while (1) x++;

因为增量不是原子操作,也不是受内存屏障保护,所以其结果几乎是未定义的。您不知道 x 将如何递增,或者甚至可能被损坏。

线程 1

while (1) atomicIncrement(&x);

线程 2

while (1) atomicIncrement(&x);

现在,您正在尝试获得明确定义的行为 - 无论顺序如何,x 都必须一一递增。如果两个线程在不同的 CPU 上运行,它们必须减少允许的缓存量,或者以其他方式“比较笔记”以确保发生合理的事情。

这种额外的开销可能非常昂贵,也是原子操作速度缓慢的主要原因。

In general, atomic operations are expensive because they require cross-CPU synchronization. A "normal" operation is allowed to operate on cached data, allowing extra speed. Take for example, on a 2 cpu system:

Thread 1

while (1) x++;

Thread 2

while (1) x++;

Because increment is not an atomic operation or protected by a memory barrier, the results of this are pretty much undefined. You don't know how x will be incremented, or it could even get corrupted.

Thread 1

while (1) atomicIncrement(&x);

Thread 2

while (1) atomicIncrement(&x);

Now, you are trying to get well defined behavior - no matter the ordering, x must increment one by one. If the two threads are running on different CPUs, they have to either reduce the amount of allowed caching or otherwise "compare notes" to make sure that something sensible happens.

This extra overhead can be quite expensive, and it's the general cause of the claim that atomic operations are slow.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文