CAS 冲突的 CPU 内部特征是什么?
我正在尝试了解 x86/x64 上 CAS 的低级机制,我非常感谢一些帮助/见解。
我一直在思考这个问题的原因是我试图推理指数退避,并原则上找出正确的退避延迟单位应该是什么。
如果我查看无锁空闲列表基准测试,没有指数退避,我会发现随着线程数量的增加,性能迅速趋于平稳。
Release 7 Lock-Free Freelist Benchmark #1
M
N
S
L3U
L2U L2U
L1D L1D
L1I L1I
P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22
0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09
0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09
正如我们所知,可能会发生活锁,其中每个线程都会阻止其他线程前进。
我最初的想法(现在我认为是错误的)是 CAS 干扰了 CAS。我的意思是,如果 CAS 指令同时发生,那么它们本身就会与另一个 CAS 发生破坏性冲突。两者都会失败。 (可能是因为我在脑海中思考着以太网)。
这“显然”解释了结果 - 所有这些 CAS 指令同时运行,很少有机会在被破坏性中断之前完全执行。
再想一想,我现在认为不可能是这样。 CAS 指令实际上没有故障模式。它会告诉您目的地等于或不等于比较数。就这样。它不会回来说“哦,对不起,撞到了别人”。
破坏性干扰正在发生,但它发生在更高的层次上,发生在数据结构算法本身中。当我们从空闲列表中推送或弹出时,我们实际上是在尝试交换。我们需要目的地稳定足够长的时间,以便我们可以读取它,做我们需要做的任何工作,然后发现它不变,这样我们就可以完成我们的推送/弹出。
如果其他线程保持 CASing,则目标不稳定 - 它不断变化 - 并且我们必须不断重试我们的操作。
但现在我很困惑。
我们看到,单个线程执行大约 3000 万次入栈/出栈操作。目的地必须在其中一项操作期间保持稳定,操作才能成功,因此我们看到有 3000 万个“槽位”。如果我们有两个线程,那么我们可以拥有的最大理论性能是每个线程 1500 万次操作;每个线程使用一半的插槽。
现在让我们回到 CAS。 CAS没有故障模式。那么,当另一个线程已经在进行 CAS 操作而第二个线程尝试进行 CAS 操作时,会发生什么情况呢?好吧,第二个线程将在数据结构级别失败,因为交换无法发生,因此它将重试交换。
但现在想象我们有很多线程。开始 CAS 的第一个线程将成功(假设每个 CAS 花费完全相同的时间 - 不正确,但该假设不会改变任何基本内容,因此可以推理)。所有其他人都会失败。
但是,一旦第一个线程完成,下一个读取新目标值的线程将使其 CAS 成功(而所有其他线程,仍在执行其 CAS 或现在开始新的 CAS,将失败)。
那么为什么我们没有看到完美的缩放呢?因为每个“槽”都应该被使用!
因此我认为我没有正确理解 CAS。
阅读 Intel 的架构软件开发人员手册,我发现如果所有数据都存在于缓存中(我感兴趣的情况),则缓存一致性协议会负责 CAS。
Drepper 在他的白皮书中描述了 LL/SC 以及它如何使用 MESI 工作。
在我看来,CAS 以类似的方式运作是合理的。
让我们考虑两个线程的情况。第一个线程开始其 CAS。具有目标的缓存行位于其缓存中并标记为独占。
第二个线程开始 CAS。第一个核心将其缓存行发送到第二个核心,并且两个核心都将该缓存行标记为共享。
第一个线程完成 CAS 并写入缓存行(写入始终发生在 x86/x64 上,即使比较为 false;它只写入原始值)。
写入行为将缓存行标记为已修改;发生 RFO,导致第二个核心将其缓存行标记为无效。
第二个线程完成其 CAS 并注意到其缓存行无效......然后,怎么办?我发现很难相信该指令在 CPU 内部循环直到它成功 - 尽管我想知道,因为 ARM 上的 LL/SC 需要您在汇编中执行此循环。但CAS指令知道destination的值已经改变,所以它的比较结果无效。但 CAS 不可能出错;它总是返回 true 或 false 进行比较。但即使指令确实循环直到完成,我仍然期望完美的缩放。每个“插槽”仍应使用。
那么会发生什么呢? CAS 发生了什么?
我所看到的是,随着线程数的增加,完成的工作越来越少 - 所有可用的“插槽”肯定都没有被使用。有某种原因导致了这种情况。 CAS指令之间是否存在破坏性干扰?或者是大量的RFO占用了CPU->北桥总线?
我非常感兴趣地注意到,同一物理核心上的两个线程完美地扩展。在这种情况下,会发生一些特殊且不同的情况 - 单独物理核心上的两个线程也会缩放一半。但这还不足以解释这一切。
I'm trying to understand the low-level mechanics of CAS on x86/x64 and I'd really appreciate some help/insight.
The reason I've been thinking about this is that I'm trying to reason about exponential backoff, and figure out in principle what the correct single unit of backoff delay should be.
If I look at a lock-free freelist benchmark, without exponential backoff, I see as the number of threads increase, performance rapidly flatlines.
Release 7 Lock-Free Freelist Benchmark #1
M
N
S
L3U
L2U L2U
L1D L1D
L1I L1I
P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22
0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09
0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09
As we know, live-lock can occur, where each thread prevents the others from progressing.
My original - and I believe now mistaken - thought was that CAS was interfering with CAS. By that I mean, the CAS instruction itself would destructively collide with another CAS, should they occur concurrently. Both would fail. (Prolly because I was in the back of my mind thinking about ethernet).
This 'obviously' explains the results - all those CAS instructions operating concurrently, very few have a chance to fully execute before being destructively interrupted.
Having thought about it some more, I believe now this cannot be the case. The CAS instruction does not in fact HAVE a failure mode. It will tell you the destination is equal or not equal to the comparand. That's all. It doesn't come back and say "oh, sorry, bumped into someone else".
Destructive interference IS occurring, but it's occurring at a higher level, in the data structure algorithm itself. When we push or pop from/to the freelist, we are actually TRYING to swap. We need the destination to be stable for long enough that we can read it, do whatever work we need to do, and then find it unchanged so we can complete our push/pop.
If other threads keep CASing, destination isn't stable - it keeps changing - and we keep having to retry our operation.
But now I'm confused.
What we see is that a single thread performs about 30 million push/pop operations. Destination has to be stable for the duration of one of these operations, for the operation to succeed, so we see there are 30 million 'slots'. If we have two threads, then the maximum theoretical performance we can have is 15 million operations per thread; each thread using half the slots.
Now let's come back to CAS. CAS has no failure mode. So what's happening when second thread tries to CAS when another thread is already CASing? well, the second thread will fail at the data structure level, since the swap couldn't occur, so it will retry the swap.
But now imagine we have LOTS of threads. The first thread to begin a CAS will succeed (assuming each CAS takes exactly the same time - not true, but that assumption doesn't change anything fundamental, so it's okay to reason with). All the others will fail.
But once the first thread has finished, the very next thread who reads the new destination value will have his CAS succeed (and all the other threads, still executing their CASs or now beginning new CASs, will fail).
So why do we not see perfect scaling? because every 'slot' should be being used!
I think therefore I do not understand CAS properly.
Reading Intel's Architecture Software Developer's Manual, I find that if all data is present in cache (which the situation I'm interested in), the cache coherency protocol takes care of CAS.
Drepper in his white paper describes LL/SC and how it works using MESI.
It seems reasonable to me for CAS to operate in a similar manner.
Let's consider the two thread case. First thread begins its CAS. The cache line with the destination is in its cache and marked exclusive.
Second thread begins to CAS. The first core sends its cache line over to the second core and both cores have that cache line marked shared.
First thread completes the CAS and writes to the cache line (the write always occurs on x86/x64, even if the compare was false; it just writes the original value).
The act of writing marks the cache line as modified; a RFO occurs, causing the second core to mark its cache line as invalid.
The second thread comes to complete its CAS and notices its cache line is invalid... and then, what? I find it hard to believe the instruction is in the CPU internally looped until it succeeds - although I wonder, because LL/SC on ARM requires you in your assembly to do this loop. But the CAS instruction knows that the value of destination has changed, so the results of its compare are invalid. But there's no error possible with CAS; it always returns true or false for the compare. But even if the instructions do loop until complete, I'd still expect perfect scaling. Each 'slot' should still be used.
So what happens? what is happening with CAS?
What I do see is that as the thread count rises, less and less work is done - all available 'slots' certainly are not being used. Something is causing this. Is it destructive interference between CAS instructions? or it is a large number of RFO hogging the CPU->northbridge bus?
What I do note with considerable interest is that two threads on the same physical core scale perfectly. Something special and different is happening in that case - two threads on separate physical cores scale half as well. But it's not enough of a clue to explain it all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您在这里看到的是在两个物理核心的 L1 缓存之间移动数据的成本。当仅使用一个核心时,数据位于 L1 缓存中,并且每个 CAS 以缓存中的数据全速运行。另一方面,当有两个核心处于活动状态时,每次一个核心成功写入数据时,都会使另一个缓存失效,这将导致在另一个核心可以执行任何操作之前需要在缓存之间复制数据(一般会在CAS完成之前阻塞等待加载)。这比实际的 CAS 昂贵得多(它至少需要将数据移动到 L3 缓存,然后返回到另一个 L1 缓存),并且会导致您看到的速度减慢,因为数据最终会出现乒乓球效应在两个 L1 缓存之间来回
What you're seeing here is the cost of moving the data between the L1 caches of the two physical cores. When only using one core, the data sits in that L1 cache and each CAS operates at full speed with the data in the cache. When there are two cores active, on the other hand, each time a core succeeds in writing to the data, it will invalidate the other cache, which will result in the data needing to be copied between the caches before the other core can do anything (generally, it will block waiting for the load before the CAS to complete). This is much more expensive than the actual CAS (it needs to move the data up to the L3 cahce at least and then back down to the other L1 cache), and results in the slowdown you see, as the data ends up ping-ponging back and forth between the two L1 caches
根据 CAS,我假设您正在谈论 LOCK CMPXCHG
您似乎认为操作开始,被中断,然后继续。 CAS 对于内存子系统来说是原子的。因此它可以一次性读取值、比较并写入。一旦获得缓存行,就不会在任何时间段内将缓存行丢失给另一个核心。这是如何运作的?它在指令执行期间引发处理器锁定信号,以便其他指令在内存子系统上停滞,直到缓存行再次可用。这就是 CMPXCHG 指令上有 LOCK 前缀的原因。您可以阅读 LOCK 说明以获取更多详细信息。
因此,发生的大部分争用都发生在 L1 上,试图获得高速缓存行的独占所有权,而该信号大多一直发出。如果 L1 已经具有高速缓存行(例如同一核心上有 2 个线程的情况),则唯一的争议在于 CAS 本身的持续时间,不包括跨核心的高速缓存行内存传输(因为它已经存在)。而且速度要快得多。
By CAS, I assume you're talking about LOCK CMPXCHG
You seem to think the operation starts, gets interrupted, continues. CAS is atomic with respect to the memory subsystem. So it reads the value, compares and writes in a single go. There is no time slot where it will lose the cacheline to another core once it acquired it. How does that work ? It raises a processor lock signal during the instruction execution, so that other instructions stall on the memory subsystem until the cacheline is available again. That's why there is a LOCK prefix on the CMPXCHG instruction. You can read the LOCK description for further details.
So most of the contention that happens is on the L1 trying to get exclusive ownership of the cacheline while that signal is mostly raised all the time. If the L1 already has the cacheline (such as in the case of 2 threads on the same core), the only contention is on the duration of the CAS itself, not including the cacheline memory transfer across cores (since it's already there). And that's a lot faster.
所以,我一直在思考这一切。
目前,我对 CAS 的处理方式有两个单独的建议 - “缓存锁”和 MESI。
这篇文章纯粹是关于缓存锁的。
高速缓存锁定假设一个核心锁定给定的高速缓存行,并且其他尝试在该高速缓存行上进行 CAS 的核心仍会释放高速缓存。
此外,我还相信 CAS 总是在完成之前将其结果写回内存。
采用该理论,让我们看看基准并尝试解释结果。
所以,首先是单线程情况;
在这里我们有最大的性能。每个“槽”都由单个线程使用。
现在我们看到同一核心上的两个线程;
当然,这里我们仍然拥有相同数量的“插槽”——CAS 需要尽可能长的时间——但我们看到它们均匀分布在逻辑处理器之间。这是有道理的;一个核心锁定高速缓存行,另一个核心停止,第一个核心完成,第二个核心获得锁定……它们交替进行。目的地保留在 L1 缓存中,缓存行处于修改状态;我们永远不需要从内存中重新读取目的地,因此从这个意义上说,我们就像单线程的情况一样。
现在我们来看看不同内核上的两个线程;
在这里,我们看到了第一次大幅放缓。我们的最大理论缩放比例是 0.5,但我们现在是 0.22。怎么会?好吧,每个线程都试图锁定同一个缓存行(当然,在它自己的缓存中),这很好 - 但问题是当核心获得锁定时,它将需要从内存中重新读取目标,因为它的缓存行将被修改其数据副本的另一个核心标记为无效。因此,我们将减慢速度归咎于我们必须执行的内存读取操作。
现在我们有四个线程,每个核心两个。
在这里,我们看到操作总数实际上仅略少于每个核心一个线程,尽管扩展当然要糟糕得多,因为我们现在有四个线程,而不是两个。
在每个核心一个线程的场景中,每个 CAS 都以读取内存开始,因为另一个核心已使 CASing 核心缓存行无效。
在这种情况下,当一个核心完成 CAS 并释放缓存锁时,三个线程正在竞争锁,两个线程在另一个核心上,一个在同一核心上。所以三分之二的时间我们需要在CAS启动时重新读取内存;三分之一的时间我们不这样做。
所以我们应该更快。但我们实际上更慢。
这让我很困惑。观察到的事实与理论不符。
So, I've been thinking all this over.
Currently, I have two separate proposals for how CAS is handled - 'cache lock' and MESI.
This post is purely with regard to cache lock.
Cache lock posits that a core locks the given cache line and other cores attempting to CAS on that cache line stall still the cache is released.
Furthemore, I believe also that CAS always writes its results back to memory before completing.
Taking that theory, let's look at the benchmark and try to interprent the results.
So, first the single thread case;
Here we have maximum performance. Every 'slot' is used by the single thread.
Now we come to two threads on the same core;
Here we still of course have the same number of 'slots' - a CAS takes as long as it takes - but we see they're evenly distributed between logical processors. This makes sense; one core locks the cache line, the other stalls, the first completes, the second gets the lock... they alternate. The destination remains in L1 cache with the cache line being in the modified state; we never need to re-read the destination from memory, so in that sense we're just like the one thread case.
Now we come to two threads on different cores;
Here we see our first big slow down. Our maximum theoretical scaling is 0.5, but we're at 0.22. How come? well, each thread is trying to lock the same cache line (in its own cache, of course) which is fine - but the problem is when a core gets the lock, it will need to re-read the destination from memory because its cache line will have been marked invalid by the other core having modified its copy of the data. So we put the slow down to the memory reads we're having to do.
Now we come to four threads, two per core.
Here we see the total number of ops is actually only slightly less than one thread per core, although of course the scaling is much worse, since we now have four threads, not two.
In the one thread per core scenario, every CAS begins with a read of memory, since the other core has invalided the CASing cores cache line.
In this scenario, when a core finishes a CAS and releases the cache lock, three threads are competing for the lock, two on another core, one on the same core. So two thirds of the time we need to re-read memory at the start of CAS; one third of the time we do not.
So we should be FASTER. But we are in fact SLOWER.
And this puzzles me. The observed facts do not fit the theory.