多核应用中的性能增益问题
我有一个用 C 编写的串行(非并行)应用程序。我已使用英特尔线程构建模块修改并重新编写了它。当我在四核 AMD Phenom II 机器上运行这个并行版本时,我获得了超过 4 倍的性能增益,这与阿姆达尔定律相冲突。谁能告诉我发生这种情况的原因吗?
谢谢, 拉凯什。
I have a serial(nonparallel) application written in C. I have modified and re-written it using Intel Threading Building Blocks. When I run this parallel version on an AMD Phenom II machine which is a quad-core machine, I get a performance gain of more than 4X which conflicts with the Amdahl's law. Can anyone give me a reason why this is happening?
Thanks,
Rakesh.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果重写该程序,可以提高其效率。阿姆达尔定律仅限制了并行性带来的加速量,而不是通过改进代码可以使代码加快多少。
您可能会意识到拥有 4 倍缓存的效果,因为现在您可以使用所有四个进程。或者与计算机上运行的其他进程的争用可能会减少。或者您不小心修复了错误预测的分支。
TL/DR:它发生了。
If you rewrite the program, you could make it more efficient. Amdahl's law only limits the amount of speedup due to parallelism, not how much faster you can make your code by improving it.
You could be realizing the effects of having 4x the cache, since now you can use all four procs. Or maybe having less contention with other processes running on your machine. Or you accidentally fixed a mispredicted branch.
TL/DR: it happens.
它被称为“超线性加速”,发生的原因有多种,但最常见的根本原因可能是缓存行为。通常,当发生超线性加速时,这表明您可以使顺序版本更加高效。
例如,假设您有一个处理器,其中一些核心共享 L2 缓存(当今的常见架构),并假设您的算法对大型数据结构进行多次遍历。如果按顺序执行遍历,则每次遍历都必须将数据重新拉入 L2 缓存,而如果并行执行遍历,则只要遍历运行在步骤(此处失步是不可预测性能的一个很好的来源)。为了使顺序版本更有效,您可以交错遍历,从而提高局部性。
It's known as "super linear speedup", and can occur for a variety of reasons, though the most common root cause is probably cache behaviour. Usually when superlinear speedup occurs, it's a clue that you could make the sequential version more efficient.
For example, suppose you have a processor where some of the cores share an L2 cache (a common architecture these days), and suppose your algorithm makes multiple traversals of a large data structure. If you perform the traversals in sequence, then each traversal will have to pull the data into the L2 cache afresh, whereas if you perform the traversals in parallel then you may well avoid a large number of those misses, as long as the traversals run in step (getting out of step is a good source of unpredictable performance here). To make the sequential verison more efficient you could interleave the traversals, thereby improving locality.
简而言之,缓存。
每个核心都有自己的 L1 缓存,因此,只需使用更多核心,您就可以增加正在使用的缓存量,从而使更多数据更接近将要处理的位置。仅此一点就可以显着提高性能(就好像您在单个核心上拥有更大的缓存一样)。当与有效并行化带来的近线性加速相结合时,您可以看到整体超线性性能改进。
In a word, caches.
Each core has its own L1 cache and, therefore, simply by using more cores you have increased the amount of cache in-play which has, in turn, brought more of your data closer to where it will be processed. That alone can improve performance significantly (as if you had a bigger cache on a single core). When combined with near linear speedup from effective parallelization, you can see superlinear performance improvements overall.