随着CPU数量的增加,原子操作是否会变得更慢?
x86 和其他体系结构提供特殊的原子指令(lock、cmpxchg 等),允许您编写“无锁”数据结构。但随着越来越多的核心被添加,似乎这些指令实际上必须在幕后完成的工作将会增加(至少是为了保持缓存一致性?)。如果今天的原子添加在双核系统上需要约 100 个周期,那么在未来的 80 多个核心机器上是否需要更长的时间?如果您编写的代码是为了持久,那么使用锁实际上可能是一个更好的主意,即使它们现在速度较慢?
x86 and other architectures provide special atomic instructions (lock, cmpxchg, etc.) that allow you to write 'lock free' data structures. But as more and more cores are added, it seems as though the work these instructions will actually have to do behind the scenes will grow (at least to maintain cache coherency?). If an atomic add takes ~100 cycles today on a dual core system, might it take significantly longer on the 80+ core machines of the future? If you're writing code to last, might it actually be a better idea to use locks even if they're slower today?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你是对的,一旦计数开始高于几十个,拓扑约束就会以某种方式增加内核之间的通信延迟。我真的不知道 x86 公司处理这种扩展的意图是什么。
但锁是通过原子操作来实现的。因此,尝试切换到它们并不能真正获胜,除非它们以比您自己的手动原子操作尝试更具可扩展性的方式实现。我认为,一般来说,对于单个令牌之类的争用,原子原语始终是最快的方法,无论您有多少个核心。
正如克雷很久以前就发现的那样,天下没有免费的午餐。在高级软件设计中,您尝试尽可能不频繁地使用潜在有争议的资源,这始终会在大规模并行应用程序中带来最大的回报。这意味着获取锁后要做尽可能多的工作,但也要尽可能快。在极端情况下,这可能意味着在成功获取锁的假设下预先计算您的工作,尝试抓住它,并在成功时尽快完成,否则丢弃您的工作并在失败时重试。
You are right that topology constraints will, one way or another, increase latency of communication between cores, once the counts start going higher than a couple dozen. I don't really know what the intentions are of the x86 companies for dealing with that sort of scaling.
But locks are implemented in terms of atomic operations. So you don't really win by trying to switch to them, unless they are implemented in a more scalable way than what you would be attempted with your own hand-rolled atomic operations. I think that generally, for single token-like contentions, atomic primitives will always still be the fastest way, regardless of how many cores you have.
As Cray discovered long time ago, there's no free lunch here. High level software design, where you try to use potentially contentious resources in as infrequent as possible will always lead to the biggest payout in massively parallelized applications. This means doing as much work as possible as the result of a lock acquisition, but as quickly as possible as well. In extreme situations, this can mean pre-calculating your work on the assumption of a successfully acquired lock, trying to grab it, and just completing as fast as possible on success, otherwise throwing away your work and retrying on fail.
对于标题中提出的问题,简短的回答是“是”,详细的回答是“很复杂”。
至于锁是否更好,答案是否定的。在内部,锁必须至少在总线上推送同样多(如果不是更多)的流量。这样想,如果处理器只有一个原子操作,即原子比较和交换,您可以使用它来实现锁和原子增量。在总线协议级别,只使用了几个原语。锁并不比原子操作慢,因为它们正在做不同的事情,它们更慢,因为它们做更多相同的事情(从一致性的角度来看)。因此,随着原子操作变慢,锁也会相应变慢。
话虽如此,关于这个主题的论文有很多很多,而且具体情况也很复杂。我不会担心您的代码将如何在具有不可预测的性能特征的 80 核 CPU 上扩展(因为我们不知道它们将如何设计)。它们要么像我们当前的 CPU 一样运行,并且您的代码将正常运行,要么它们不会,并且您现在猜测的任何内容都将被证明是错误的。在大多数情况下,最终会发现代码对性能并不敏感,所以这并不重要,但如果确实如此,那么当您了解架构和性能特征时,适当的做法就是在将来修复它您的目标处理器。
For the question posed in the title, the short answer is "yes," the long answer is "it is complicated."
With regards to locks being better, no. Internally a lock has to push at least as much (if not more) traffic over the bus. Think about it this way, if the the processor only has one atomic operation, an atomic compare and swap, you could use it to implement locks and atomic increments. Well at a bus protocol level there are only a few primitives that are used. Locks are not slower than atomic operations because they are doing something different, they are slower because they are doing more of the same thing (from a coherency standpoint). So as atomic operations slow down, locks will tend to slow down comparably.
Having said that, there are lots and lots of papers on the subject and particular cases are complicated. I wouldn't worry about how your code is going to scale on 80 core CPUs that have unpredictable performance characteristics (because we don't know how they will be designed). Either they'll behave like our current CPUs and your code will perform fine, or they won't and whatever you guessed now will turn out to have been wrong. In most cases it will turn out the code wasn't performance sensitive anyway, so it doesn't matter, but if it does then the appropriate thing to do will be to fix it in the future when when you understand the architectural and performance characteristics of your target processors.
我不认为问题在于原子操作本身会花费更长的时间;而是原子操作本身会花费更长的时间。真正的问题可能是原子操作可能会阻塞其他处理器上的总线操作(即使它们执行非原子操作)。
如果你想把代码写到最后,首先就尽量避免锁定。
I don't think the problem is that atomic operations will take longer themselves; the real problem might be that an atomic operation might block bus operations on other processors (even if they perform non-atomic operations).
If you want to write code to last, try to avoid locking in the first place.
关于这个问题,值得一提的是,您所指的未来是 GPU 中已经存在的技术。现代 Quadro GPU 拥有多达 256 个核心,可以在全局(显示)内存上执行原子操作。
我不确定这是如何实现的,但事实是它已经发生了。
On a side note to this question, it is worth mentioning that the future you refer to is already present technology in GPUs. a modern quadro GPU has as much as 256 cores and can pefrorm atomic operations on the global (display) memory.
I'm not sure how this is achieved but the fact is that it's already happening.