如何在支持超线程的多核机器上调度应用程序?
我试图更好地了解支持超线程的多核处理器的工作原理。假设我有一个可以使用 MPI 或 OpenMP 或 MPI+OpenMP 编译的应用程序。我想知道它将如何在具有四个 Xeon X7560 @ 2.27GHz 处理器且每个处理器核心都启用了超线程的 CentOS 5.3 机器上进行调度。
/proc/cpuinfo 中处理器的编号为 0 到 63。据我了解,有四个 8 核物理处理器,总共 PHYSICAL CORES 为 32 个,每个处理器核心都启用了超线程,总共 LOGICAL 处理器为 64 个。
用 MPICH2 编译 如果我使用 mpirun -np 16 运行,将使用多少个物理核心?它是否被划分为可用的 16 个物理核心或 16 个逻辑处理器(8 个物理核心使用超线程)?
使用 OpenMP 编译 如果我设置 OMP_NUM_THREADS=16,将使用多少个物理核心?它将使用 16 个逻辑处理器吗?
使用 MPICH2+OpenMP 编译 如果我设置 OMP_NUM_THREADS=16 并使用 mpirun -np 16 运行,将使用多少个物理内核?
使用 OpenMPI 编译
OpenMPI 有两个运行时选项
-cpu-set 指定分配给作业的逻辑 cpu, -cpu-per-proc 指定每个进程使用的 cpu 数量。
如果使用 mpirun -np 16 -cpu-set 0-15 运行,它只会使用 8 个物理核心吗?
如果使用 mpirun -np 16 -cpu-set 0-31 -cpu-per-proc 2 运行,它将如何调度?
谢谢杰瑞
I'm trying to gain a better understanding of how hyper-threading enabled multi-core processors work. Let's say I have an app which can be compiled with MPI or OpenMP or MPI+OpenMP. I wonder how it will be scheduled on a CentOS 5.3 box with four Xeon X7560 @ 2.27GHz processors and each processor core has Hyper-Threading enabled.
The processor is numbered from 0 to 63 in /proc/cpuinfo. For my understanding, there are FOUR 8-cores physical processors, the total PHYSICAL CORES are 32, each processor core has Hyper-Threading enabled, the total LOGICAL processors are 64.
Compiled with MPICH2
How many physical cores will be used if I run with mpirun -np 16? Does it get divided up amongst the available 16 PHYSICAL cores or 16 LOGICAL processors ( 8 PHYSICAL cores using hyper-threading)?compiled with OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16? Does it will use 16 LOGICAL processors ?Compiled with MPICH2+OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16 and run with mpirun -np 16?Compiled with OpenMPI
OpenMPI has two runtime options
-cpu-set which specifies logical cpus allocated to the job,
-cpu-per-proc which specifies number of cpu to use for each process.
If run with mpirun -np 16 -cpu-set 0-15, will it only use 8 PHYSICAL cores ?
If run with mpirun -np 16 -cpu-set 0-31 -cpu-per-proc 2, how it will be scheduled?
Thanks
Jerry
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果可能的话,我希望任何明智的调度程序都更喜欢在不同的物理处理器上运行线程。然后我预计它会更喜欢不同的物理核心。最后,如果必须的话,它将开始在每个物理核心上使用超线程第二线程。
基本上,当线程必须共享处理器资源时,它们会减慢速度。因此,最佳策略通常是尽量减少处理器资源共享量。对于 CPU 密集型进程来说,这是正确的策略,并且操作系统通常认为它正在处理这种情况。
I'd expect any sensible scheduler to prefer running threads on different physical processors if possible. Then I'd expect it to prefer different physical cores. Finally, if it must, it would start using the hyperthreaded second thread on each physical core.
Basically when threads have to share processor resources they slow down. So the optimal strategy is usually to minimise the amount of processor resource sharing. This is the right strategy for CPU bound processes and that's normally what an OS assumes it is dealing with.
我大胆猜测调度程序将尝试将线程保留在同一物理核心上的一个进程中。因此,如果您有 16 个线程,它们将位于最少数量的物理内核上。造成这种情况的原因是缓存局部性;与来自不同进程的线程相比,来自同一进程的线程更有可能接触相同的内存。 (例如,跨核心的高速缓存行失效的成本很高,但同一核心中的逻辑处理器不会发生这种成本)。
I would hazard a guess that the scheduler will try to keep threads in one process on the same physical cores. So if you had sixteen threads, they would be on the smallest number of physical cores. The reason for this would be cache locality; it would be considered threads from the same process would be more likely to touch the same memory, than threads from different processes. (For example, the costs of cache line invalidation across cores is high, but that cost does not occur for logical processors in the same core).
正如您从其他两个答案中看到的,理想的调度策略根据线程正在执行的活动而变化。
处理完全不同数据的线程受益于更多的分离。理想情况下,这些线程应该安排在单独的 NUMA 域和物理核心中。
处理相同数据的线程将受益于缓存局部性,因此想法策略是将它们安排在一起,以便它们共享缓存。
处理相同数据并经历大量管道停顿的线程受益于共享超线程核心。每个线程都可以运行直到停止,此时另一个线程可以运行。没有停顿运行的线程只会受到超线程的影响,并且应该在不同的内核上运行。
做出理想的调度决策依赖于大量的数据收集和大量的决策。操作系统设计中的一个很大的危险是线程调度过于智能。如果操作系统花费大量处理器时间来尝试找到运行线程的理想位置,那么它就浪费了可用于运行线程的时间。
因此,通常使用简化的线程调度程序会更有效,并且如果需要,让程序指定自己的策略。这是线程亲和性设置。
As you can see from the other two answers the ideal scheduling policy varies depending on what activity the threads are doing.
Threads working on completely different data benefit from more separation. These threads would ideally be scheduled in separate NUMA domains and physical cores.
Threads working on the same data will benefit from cache locality, so the idea policy is to schedule them close together so they share cache.
Threads that work on the same data and experience a large amount of pipeline stalls benefit from sharing a hyperthread core. Each thread can run until it stalls, at which point the other thread can run. Threads that run without stalls are only hurt by hyperthreading and should be run on different cores.
Making the ideal scheduling decision relies on a lot of data collection and a lot of decision making. A large danger in OS design is to make the thread scheduling too smart. If the OS spends a lot of processor time trying to find the ideal place to run a thread, it's wasting time it could be using to run the thread.
So often it's more efficient to use a simplified thread scheduler and if needed, let the program specify its own policy. This is the thread affinity setting.