双核超线程:我应该使用 4 个线程还是 3 个或 2 个线程?
如果您同时生成多个线程(或进程),假设任务是生成尽可能多的物理处理器数量或逻辑处理器数量,是否更好是CPU限制的吗?或者是在两者之间做一些事情(比如 3 个线程)更好?
性能是否取决于正在执行的指令类型(例如,非本地内存访问与缓存命中是否有很大不同)?如果是这样,在什么情况下最好利用超线程?
更新:
我问的原因是,我记得在某处读到,如果您的任务与虚拟处理器的数量一样多,则同一物理核心上的任务有时会耗尽一些 CPU 资源,并阻止彼此获得与虚拟处理器数量一样多的资源。需要,可能会降低性能。这就是为什么我想知道拥有与虚拟核心一样多的线程是否是一个好主意。
If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
Update:
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
性能取决于多种因素。大多数任务并不严格受 CPU 限制,因为即使所有数据都在内存中,它通常也不位于处理器缓存中。我见过一些例子(例如这个),其中内存访问模式可以极大地改变给定的“并行”过程。
简而言之,没有适合所有情况的完美数字。
The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.
在启用超线程的情况下,您很有可能会看到每个内核运行 2 个线程的性能得到提高。 看起来完全受 CPU 限制的作业通常并非如此,超线程可以从偶尔的中断或上下文切换中提取一些“额外”周期。
另一方面,对于具有 Turbo Boost 功能的核心 iX 处理器,您实际上可能会更好地为每个核心运行 1 个线程,以鼓励 CPU 自行超频。
在工作中,我们通常会在满载 CPU 的情况下运行多核服务器,一次执行各种计算数天。不久前,我们测量了使用和不使用 HT 的性能差异。我们发现,平均而言,使用超线程技术并同时运行两倍的作业,我们完成相同数量的作业的速度比不使用超线程技术快约 10%。
假设 2 个内核是一个好的起点,但底线是:测量!
Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!
我记得有消息称超线程可以使性能提升高达 30%。一般来说,你最好将它们视为 4 个不同的核心。当然,在某些特定情况下(例如,将相同的长时间运行的任务绑定到每个核心),您可以更好地划分处理,考虑到某些核心只是逻辑核心
有关超线程本身的更多信息此处
I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here
使用超线程在同一核心上运行两个线程,当两个线程具有相似的内存访问模式但访问不相交的数据结构时,大致相当于在两个单独的核心上运行它们,每个核心都有一半的缓存。如果内存访问模式使得一半的高速缓存足以防止颠簸,那么性能可能会很好。如果内存访问模式使缓存减半会导致系统抖动,则性能可能会下降十倍(这意味着如果没有超线程,情况会好得多)。
另一方面,在某些情况下,超线程可能会带来巨大的胜利。如果许多线程都使用无锁数据结构读取和写入相同的共享数据,并且所有线程必须看到一致的数据视图,则尝试在不相交的处理器上运行线程可能会导致系统抖动,因为一次可能只有一个处理器对任何给定的缓存行具有读写访问权限;在两个内核上运行这样的线程可能比一次只运行一个线程花费更长的时间。然而,当单核上的多个线程访问一条数据时,不需要这种高速缓存仲裁。在这些情况下,超线程可能是一个巨大的胜利。
不幸的是,我不知道有什么方法可以给调度程序任何“提示”,以建议某些线程应在可能的情况下共享一个核心,而其他线程应在可能的情况下单独运行。
Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.
对于使用额外虚拟核心的大部分 CPU 密集型任务,HT 可以将性能提升大约 10-30%。尽管这些任务可能看起来受 CPU 限制,但除非它们是定制的组件,否则它们通常会遭受 RAM 和本地缓存之间的 IO 等待。这允许在启用 HT 的物理核心上运行的一个线程在另一个线程等待 IO 时工作。但这确实有一个缺点,因为两个线程共享相同的缓存/总线,这将导致每个线程占用的资源较少,这可能会导致两个线程在等待 IO 时暂停。
在最后一种情况下,运行单个线程将降低最大同时理论处理能力(10-30%),有利于运行单个线程,而不会减慢缓存抖动,这在某些应用程序中可能非常重要。
选择使用哪些核心与选择运行多少个线程同样重要。如果每个线程在大致相同的持续时间内受 CPU 限制,则最好设置关联性,以便使用大部分不同资源的线程发现自己位于不同的物理核心上,并且使用公共资源的线程被分组到相同的物理核心(不同的虚拟核心),这样可以从同一个缓存使用公共资源,无需额外的 IO 等待。
由于每个程序都有不同的 CPU 使用特征,并且缓存抖动可能会也可能不会导致严重的速度下降(通常是这样),因此如果不首先进行分析,就不可能确定理想的线程数应该是多少。最后要注意的一件事是操作系统/内核还需要一些 CPU 和缓存空间。如果 CPU 绑定线程需要实时延迟,通常最好为操作系统保留一个(物理)核心,以避免共享缓存/CPU 资源。如果线程经常等待 IO 并且缓存抖动不是问题,或者如果运行专门为应用程序设计的实时操作系统,则可以跳过最后一步。
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity
HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity
所有其他答案已经提供了很多出色的信息。但是,还需要考虑的一点是 SIMD 单元在同一芯片上的逻辑核心之间共享。因此,如果您使用 SSE 代码运行线程,您是在所有 4 个逻辑核心上运行它们,还是只生成 2 个线程(假设您有两个芯片)?对于这种奇怪的情况,最好使用您的应用程序进行分析。
All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.