最新 Xeon 上的 FP 密集型超线程性能
我们最近购买了双 Intel X5650 工作站来运行浮点密集型仿真,在 Ubuntu 10.04 下。
每个X5650有6个核心,所以总共有12个核心。该代码基本上是并行的,因此我主要使用 12 个线程运行它,并通过“top”观察到大约“1200%”的处理器利用率。
BIOS 中启用了超线程,因此操作系统名义上有 24 个可用核心。如果我将线程数增加到 24,top 会报告大约 2000% 的处理器利用率 - 但是,实际代码性能似乎并未增加 20/12。
我的问题是 - 超线程在最新一代 Xeon 上实际上如何工作?浮点密集型代码是否会从每个核心调度多个线程中受益?如果工作集与缓存大小相当(与大几倍相比),或者如果存在大量 I/O 操作(例如将模拟输出写入磁盘),答案是否会改变?
另外 - 当启用超线程时,我应该如何解释“顶部”的处理器利用率百分比?
We have recently purchased a dual Intel X5650 workstation to run a floating-point intensive simulation, under Ubuntu 10.04.
Each X5650 has 6 cores, so there are 12 cores in total. The code is trivially parallel, so I have been running it mostly with 12 threads, and observing approximately "1200%" processor utilization through "top".
HyperThreading is enabled in the BIOS, so the operating system nominally sees 24 cores available. If I increase the number of threads to 24, top reports approximately 2000% processor utilization - however, it does not appear that the actual code performance increases by 20/12.
My question is - how does HyperThreading actually work on the latest generation of Xeons? Would a floating-point intensive code benefit from scheduling more than one thread per core? Does the answer change if the working set is on the order of the cache size, as compared to several times larger, or if there are substantial I/O operations (e.g. writing simulation outputs to disk)?
Additionally - how should I interpret processor utilization percentages from "top" when hyperthreading is enabled?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 HT,操作系统将同时为每个核心调度 2 个线程。 top 报告的利用率本质上只是在其采样间隔(通常为 1 秒)内处于“运行”状态的线程的平均数量。正在运行的线程可供CPU 执行,但可能无法完成太多工作,例如,如果它们大多因缓存未命中而停滞。
当一个线程在实际 I/O(网络、磁盘等)上被阻塞时,操作系统将从核心中取消调度它并调度其他一些就绪线程,因此 HT 将无济于事。
HT 试图从数学执行单元中获得更多利用率,而无需实际将核心中的硬件增加一倍。如果一个线程有足够的指令级并行性并且不会过多错过缓存,那么它将大部分填满核心资源,而 HT 将无济于事。对于数据不适合缓存的重型 FP 应用程序,HT 可能仍然没有多大帮助,因为两个线程都使用相同的执行单元(SSE 数学),并且都需要比完整缓存更多的内容 - 事实上,这很可能会受到伤害,因为他们会争夺缓存并造成更多的混乱。当然,这取决于您正在做什么以及您的数据访问模式是什么样的。
HT 主要帮助处理具有不规则和不可预测的访问模式的分支代码。对于 FP 密集型代码,通常可以通过每个核心 1 个线程并仔细设计访问模式(例如良好的数据阻塞)来做得更好。
With HT, the OS will schedule 2 threads to each core at the same time. The utilization reported by top is essentially just the average number of threads in the "running" state over its sampling interval (typically 1 second). Running threads are available for the CPU to execute, but may not be getting much work done, e.g. if they're mostly stalled on cache misses.
When a thread is blocked on real I/O -- network, disk, etc. -- the OS will deschedule it from the core and schedule some other ready thread, so HT won't help.
HT tries to get more utilization out of the math execution units without actually doubling very much hardware in the core. If one thread has enough instruction-level parallelism and doesn't miss cache much, then it'll mostly fill up the core's resources and HT won't help. For heavy FP apps with data that doesn't fit in cache, HT still probably won't help much, since both threads are using the same execution units (SSE math) and both need more than the full cache -- in fact it's likely to hurt since they'll be competing for cache and thrashing more. Of course it depends on exactly what you're doing and what your data access patterns look like.
HT mostly helps on branchy code with irregular and unpredictable access patterns. For FP-intensive code you can often do better with 1 thread per core and careful design of your access patterns (e.g. good data blocking).
我开发了一个非常高性能、极其并行的代码,它将在尽可能多的可用内核上运行。最初它在 2 核 AMD 笔记本电脑上运行,但当我转向 2 核 + HT intel 笔记本电脑时,执行性能的改进是微不足道的:新一代 CPU、另外两个 (HT) 内核和 670Mhz 更高的 CPU 时钟的存在可能只是不被注意到。当我将代码限制为两个非 HT 线程时,2 核情况下的预期加速突然出现,我可以松口气了。
当我将编译器优化级别从 3 更改为 2 并最终 1 时,超线程开始显示出它的前景。最佳结果是优化级别 1,比 2 代码非 HT 情况好大约 50%。
我认为发生的情况是,编写得太好的和优化的代码最大限度地利用了核心,以至于基本上没有额外的可用资源可供第二个线程执行。当然,第二个线程将会运行,但是只要两个线程需要相同的资源,它们就会发生冲突。由于高优化级别,他们会更频繁地这样做。
通过使用不太优化或密集的代码,线程有机会在更大程度上“交错”对核心资源的访问。这导致两个线程的运行速度约为最高优化代码在一个内核上运行速度的 75%。总结一下,两个线程上优化程度较低的代码产生的吞吐量是一个线程上优化程度最高的代码的 1.5 倍。
我想到了编写代码来查看可以实现什么级别的核心资源“交错”,并且我假设这样的线程将在一个 CPU 执行管道中花费一半的内部循环执行时间,另一半在另一个 CPU 执行管道中。预期结果“将”是一个将在另一个内部循环后面执行一个半内部循环,以实现最佳资源“交错”。
I have developed a very high-performing, embarassingly parallel code which will run on as many cores as are available. Initially it ran on a 2-core AMD laptop but when I moved to a 2-core+HT intel laptop the execution improvement was marginal: the presence of generation-later CPU, two more (HT) cores and 670Mhz higher CPU clock could just not be noticed. When I restricted the code to two non-HT threads the expected speed-up in the 2-core case was suddenly there and I could breathe easier.
When I changed the compiler optimization level from 3 to 2 and finally 1 hyperthreading started showing its promise. The best results were at optimization level 1 and was approximately 50% better than the 2-code non-HT case.
What I think happens is that too well-written and -optimized code utilizes a core to the utmost, to the extent that there are basically no extra available resources for a second thread to execute on. Of course the second thread will run but the two threads will be colliding whenever they need the same resource. They will do this much more often due to the high optimization level.
By having less optimized or dense code there was an opportunity for the threads to "interleave" their accesses to the core's resources to a larger degree. This resulted in two threads each running at around 75% of the rate that the most highly optimized code would run on one core. When you sum it up the less optimized code on two threads would yield 1.5 times the throughput of the most optimized on one.
I have entertained the idea of writing code to see what level of core resource "interleaving" that might be achieved and I hypothesize that such a thread would spend half of its execution of an inner loop in one CPU execution pipe and half in the other. The expected result "would" be that one would execute one half inner loop behind the other to achieve the best resource "interleaving."