多核 超线程——线程是如何分布的?

发布于 2024-07-10 08:22:08 字数 964 浏览 9 评论 0 原文

我正在阅读对新 Intel Atom 330 的评论,他们指出任务管理器显示 4 个核心 - 两个物理核心,加上另外两个由超线程模拟的核心。

假设您有一个带有两个线程的程序。 还假设这些是唯一在 PC 上执行任何工作的线程,其他所有线程都处于空闲状态。 操作系统将两个线程放在同一个核心上的概率是多少? 这对程序吞吐量有着巨大的影响。

如果答案不是 0%,除了创建更多线程之外还有其他缓解策略吗?

我预计 Windows、Linux 和 Mac OS X 会有不同的答案。


Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.

I was reading a review of the new Intel Atom 330, where they noted that Task Manager shows 4 cores - two physical cores, plus two more simulated by Hyperthreading.

Suppose you have a program with two threads. Suppose also that these are the only threads doing any work on the PC, everything else is idle. What is the probability that the OS will put both threads on the same core? This has huge implications for program throughput.

If the answer is anything other than 0%, are there any mitigation strategies other than creating more threads?

I expect there will be different answers for Windows, Linux, and Mac OS X.


Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

挽袖吟 2024-07-17 08:22:09

Linux 有一个相当复杂的线程调度程序,它可以识别 HT。 它的一些策略包括:

被动负载平衡:如果物理 CPU 正在运行多个任务,则调度程序将尝试在第二个物理处理器上运行任何新任务。

主动负载平衡:如果有 3 个任务,其中 2 个在一个物理 CPU 上,1 个在另一个物理处理器上,当第二个物理处理器空闲时,调度程序将尝试将其中一个任务迁移到该处理器。

它在尝试保持线程亲和性的同时执行此操作,因为当线程迁移到另一个物理处理器时,它将必须从主内存重新填充所有级别的缓存,从而导致任务停顿。

所以回答你的问题(至少在 Linux 上); 假设双核超线程机器上有 2 个线程,每个线程将在其自己的物理核心上运行。

Linux has quite a sophisticated thread scheduler which is HT aware. Some of its strategies include:

Passive Loadbalancing: If a physical CPU is running more than one task the scheduler will attempt to run any new tasks on a second physical processor.

Active Loadbalancing: If there are 3 tasks, 2 on one physical cpu and 1 on the other when the second physical processor goes idle the scheduler will attempt to migrate one of the tasks to it.

It does this while attempting to keep thread affinity because when a thread migrates to another physical processor it will have to refill all levels of cache from main memory causing a stall in the task.

So to answer your question (on Linux at least); given 2 threads on a dual core hyperthreaded machine, each thread will run on its own physical core.

老娘不死你永远是小三 2024-07-17 08:22:09

正常的操作系统会尝试在自己的核心上安排计算密集型任务,但是当您开始上下文切换它们时就会出现问题。 现代操作系统仍然倾向于在调度时没有工作的内核上调度事物,但这可能会导致并行应用程序中的进程相当自由地从一个内核交换到另一个内核。 对于并行应用程序,您不希望出现这种情况,因为您会丢失进程可能在其核心的缓存中使用的数据。 人们使用处理器亲和力来控制这一点,但在 Linux 上,sched_affinity() 的语义在发行版/内核/供应商等之间可能有很大差异。

如果您使用的是 Linux,则可以使用 便携式 Linux 处理器关联库 (PLPA)。 这就是 OpenMPI 在内部使用的内容,以确保多核和多插槽系统中的进程被安排到自己的核心上; 他们刚刚将该模块作为一个独立的项目剥离出来。 OpenMPI 在洛斯阿拉莫斯等许多其他地方使用,因此这是经过充分测试的代码。 我不确定 Windows 下的等效项是什么。

A sane OS will try to schedule computationally intensive tasks on their own cores, but problems arise when you start context switching them. Modern OS's still have a tendency to schedule things on cores where there is no work at scheduling time, but this can result in processes in parallel applications getting swapped from core to core fairly liberally. For parallel apps, you do not want this, because you lose data the process might've been using in the caches on its core. People use processor affinity to control for this, but on Linux, the semantics of sched_affinity() can vary a lot between distros/kernels/vendors, etc.

If you're on Linux, you can portably control processor affinity with the Portable Linux Processor Affinity Library (PLPA). This is what OpenMPI uses internally to make sure processes get scheduled to their own cores in multicore and multisocket systems; they've just spun off the module as a standalone project. OpenMPI is used at Los Alamos among a number of other places, so this is well-tested code. I'm not sure what the equivalent is under Windows.

━╋う一瞬間旳綻放 2024-07-17 08:22:09

我一直在寻找有关 Windows 上线程调度的一些答案,并且有一些经验信息,我将在这里发布给将来可能偶然发现这篇文章的任何人。

我编写了一个启动两个线程的简单 C# 程序。 在我的四核 Windows 7 机器上,我看到了一些令人惊讶的结果。

当我不强制关联时,Windows 将两个线程的工作负载分散到所有四个核心上。 有两行代码被注释掉 - 一行将线程绑定到 CPU,另一行建议理想的 CPU。 该建议似乎没有效果,但设置线程关联确实导致 Windows 在自己的核心上运行每个线程。

为了获得最佳结果,请使用 .NET Framework 4.0 客户端附带的免费编译器 csc.exe 编译此代码,并在具有多核的计算机上运行它。 注释掉处理器关联线后,任务管理器显示线程分布在所有四个核心上,每个核心的运行率约为 50%。 设置亲和力后,两个线程将两个核心的利用率最大化为 100%,而另外两个核心则处于空闲状态(这是我在运行此测试之前所期望看到的情况)。

编辑:
我最初发现这两种配置的性能存在一些差异。 然而,我无法重现它们,所以我编辑了这篇文章来反映这一点。 我仍然发现线程亲和力很有趣,因为它不是我所期望的。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;

class Program
{
    [DllImport("kernel32")]
    static extern int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
        Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
        Stopwatch time = Stopwatch.StartNew();
        Task.WaitAll(task1, task2);
        Console.WriteLine(time.Elapsed);
    }

    static void ThreadFunc(int cpu)
    {
        int cur = GetCurrentThreadId();
        var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
        //me.ProcessorAffinity = (IntPtr)cpu;     //using this line of code binds a thread to each core
        //me.IdealProcessor = cpu;                //seems to have no effect

        //do some CPU / memory bound work
        List<int> ls = new List<int>();
        ls.Add(10);
        for (int j = 1; j != 30000; ++j)
        {
            ls.Add((int)ls.Average());
        }
    }
}

I have been looking for some answers on thread scheduling on Windows, and have some empirical information that I'll post here for anyone who may stumble across this post in the future.

I wrote a simple C# program that launches two threads. On my quad core Windows 7 box, I saw some surprising results.

When I did not force affinity, Windows spread the workload of the two threads across all four cores. There are two lines of code that are commented out - one that binds a thread to a CPU, and one that suggests an ideal CPU. The suggestion seemed to have no effect, but setting thread affinity did cause Windows to run each thread on their own core.

To see the results best, compile this code using the freely available compiler csc.exe that comes with the .NET Framework 4.0 client, and run it on a machine with multiple cores. With the processor affinity line commented out, Task Manager showed the threads spread across all four cores, each running at about 50%. With affinity set, the two threads maxed out two cores at 100%, with the other two cores idling (which is what I expected to see before I ran this test).

EDIT:
I initially found some differences in performance with these two configurations. However, I haven't been able to reproduce them, so I edited this post to reflect that. I still found the thread affinity interesting since it wasn't what I expected.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;

class Program
{
    [DllImport("kernel32")]
    static extern int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
        Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
        Stopwatch time = Stopwatch.StartNew();
        Task.WaitAll(task1, task2);
        Console.WriteLine(time.Elapsed);
    }

    static void ThreadFunc(int cpu)
    {
        int cur = GetCurrentThreadId();
        var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
        //me.ProcessorAffinity = (IntPtr)cpu;     //using this line of code binds a thread to each core
        //me.IdealProcessor = cpu;                //seems to have no effect

        //do some CPU / memory bound work
        List<int> ls = new List<int>();
        ls.Add(10);
        for (int j = 1; j != 30000; ++j)
        {
            ls.Add((int)ls.Average());
        }
    }
}
瑾夏年华 2024-07-17 08:22:09

操作系统不会使用尽可能多的物理核心的概率基本上为 0%。 你的操作系统并不愚蠢。 它的工作是安排一切,并且它完全了解它有哪些可用的核心。 如果它看到两个 CPU 密集型线程,它将确保它们在两个物理内核上运行。

编辑
详细说明一下,对于高性能的东西,一旦您进入 MPI 或其他重要的并行化框架,您肯定希望控制每个内核上运行的内容。

操作系统将尽最大努力尝试利用所有核心,但它没有您所做的长期信息,即“该线程将运行很长时间”,或者“我们“将有这么多线程并行执行”。 因此它无法做出完美的决策,这意味着您的线程将不时被分配给新的核心,这意味着您将遇到缓存未命中和类似的情况,这会花费一些时间。 对于大多数用途来说,它已经足够好了,您甚至不会注意到性能差异。 如果这很重要的话,它也可以与系统的其他部分很好地配合。 (在某人的桌面系统上,这可能相当重要。在一个有几千个 CPU 专门用于此任务的网格中,您并不是特别想玩得很好,您只是想使用每个可用的时钟周期)。

因此,对于大规模 HPC 的东西,是的,您会希望每个线程都保留在一个核心上,并且是固定的。 但对于大多数较小的任务来说,这并不重要,您可以信任操作系统的调度程序。

The probability is essentially 0% that the OS won't utilize as many physical cores as possible. Your OS isn't stupid. Its job is to schedule everything, and it knows full well what cores it has available. If it sees two CPU-intensive threads, it will make sure they run on two physical cores.

Edit
Just to elaborate a bit, for high-performance stuff, once you get into MPI or other serious parallelization frameworks, you definitely want to control what runs on each core.

The OS will make a sort of best-effort attempt to utilize all cores, but it doesn't have the long-term information that you do, that "this thread is going to run for a very long time", or that "we're going to have this many threads executing in parallel". So it can't make perfect decisions, which means that your thread will get assigned to a new core from time to time, which means you'll run into cache misses and similar, which costs a bit of time. For most purposes, it's good enough, and you won't even notice the performance difference. And it also plays nice with the rest of the system, if that matters. (On someone's desktop system, that's probably fairly important. In a grid with a few thousand CPU's dedicated to this task, you don't particularly want to play nice, you just want to use every clock cycle available).

So for large-scale HPC stuff, yes, you'll want each thread to stay on one core, fixed. But for most smaller tasks, it won't really matter, and you can trust the OS's scheduler.

一片旧的回忆 2024-07-17 08:22:09

这是一个非常好的且相关的问题。 众所周知,超线程核心并不是真正的CPU/核心。 相反,它是一个虚拟 CPU/核心(从现在开始我会说核心)。 从 Windows XP 开始,Windows CPU 调度程序应该能够区分超线程(虚拟)核心和真实核心。 您可能会想象,在这个完美的世界中,它“恰到好处”地处理它们,这不是问题。 你错了。

Microsoft 自己的优化 Windows 2008 BizTalk 服务器的建议建议禁用超线程。 对我来说,这表明超线程核心的处理并不完美,有时线程会在超线程核心上获得时间片并遭受损失(实际核心性能的一小部分,10% I' d 猜测,微软猜测 20-30%)。

Microsoft 文章参考,其中建议禁用超线程以提高服务器效率:http ://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx

这是BIOS更新后的第二个建议,这就是他们认为它的重要性。 他们说:

来自微软:

"在 BizTalk 上禁用超线程
服务器和 SQL Server 计算机

这是关键的超线程
为 BizTalk Server 关闭

电脑。 这是 BIOS 设置,
通常在处理器中找到
BIOS 设置的设置。
超线程使服务器
似乎有更多
处理器/处理器核心比它
确实如此; 然而超线程
处理器通常提供
性能的 20% 和 30%
物理处理器/处理器核心。
当 BizTalk Server 计算数量时
的处理器来调整其
自调整算法; 这
超线程处理器会导致这些
的调整是倾斜的
有损整体性能。 ”

现在,他们确实说这是因为它抛弃了自调整算法,但接着又提到了争用问题(这表明这是一个更大的调度问题,至少对我来说)。你可以随意阅读,但我我认为这说明了一切。当使用单 CPU 系统时,超线程是一个好主意,但现在它只是一个会损害多核世界性能的复杂因素,

您可以使用 Process Lasso(免费)之类的程序。 )为关键进程设置默认的CPU亲和力,这样它们的线程就永远不会分配给虚拟CPU。

我认为没有人真正知道Windows CPU调度程序处理虚拟CPU的效果如何,但我认为它是。可以肯定地说,XP 处理得最差,从那时起他们逐渐改进了它,但它仍然不完美。事实上,它可能永远不会完美,因为操作系统不知道什么线程最适合。安装这些较慢的虚拟核心可能就是问题所在,也是微软建议在服务器环境中禁用超线程的原因。

还要记住,即使没有超线程,也存在“核心抖动”的问题。 如果您可以在单个核心上保留一个线程,那是一件好事,因为它减少了核心更改的损失。

This is a very good and relevant question. As we all know, a hyper-threaded core is not a real CPU/core. Instead, it is a virtual CPU/core (from now on I'll say core). The Windows CPU scheduler as of Windows XP is supposed to be able to distinguish hyperthreaded (virtual) cores from real cores. You might imagine then that in this perfect world it handles them 'just right' and it is not an issue. You would be wrong.

Microsoft's own recommendation for optimizing a Windows 2008 BizTalk server recommends disabling HyperThreading. This suggests, to me, that the handling of hyper-threaded cores isn't perfect and sometimes threads get a time slice on a hyper-threaded core and suffer the penalty (a fraction of the performance of a real core, 10% I'd guess, and Microsoft guesses 20-30%).

Microsoft article reference where they suggest disabling HyperThreading to improve server efficiency: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx

It is the SECOND recommendation after BIOS update, that is how important they consider it. They say:

FROM MICROSOFT:

"Disable hyper-threading on BizTalk
Server and SQL Server computers

It is critical hyper-threading be
turned off
for BizTalk Server
computers. This is a BIOS setting,
typically found in the Processor
settings of the BIOS setup.
Hyper-threading makes the server
appear to have more
processors/processor cores than it
actually does; however hyper-threaded
processors typically provide between
20 and 30% of the performance of a
physical processor/processor core.
When BizTalk Server counts the number
of processors to adjust its
self-tuning algorithms; the
hyper-threaded processors cause these
adjustments to be skewed which is
detrimental to overall performance. "

Now, they do say it is due to it throwing off the self-tuning algorithms, but then go on to mention contention problems (suggesting it is a larger scheduling issue, at least to me). Read it as you will, but I think it says it all. HyperThreading was a good idea when were with single CPU systems, but is now just a complication that can hurt performance in this multi-core world.

Instead of completely disabling HyperThreading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to virtual CPUs.

So.... I don't think anyone really knows just how well the Windows CPU Scheduler handles virtual CPUs, but I think it is safe to say that XP handles it worst, and they've gradually improved it since then, but it still isn't perfect. In fact, it may NEVER be perfect because the OS doesn't have any knowledge of what threads are best to put on these slower virtual cores. That may be the issue there, and why Microsoft recommends disabling HyperThreading in server environments.

Also remember even WITHOUT HyperThreading, there is the issue of 'core thrashing'. If you can keep a thread on a single core, that's a good thing, as it reduces the core change penalties.

-小熊_ 2024-07-17 08:22:09

您可以通过为两个线程提供处理器关联来确保它们被调度到相同的执行单元。 这可以在 Windows 或 UNIX 中通过 API(以便程序可以请求它)或通过管理界面(以便管理员可以设置它)来完成。 例如,在 WinXP 中,您可以使用任务管理器来限制进程可以在哪个逻辑处理器上执行。

否则,调度基本上是随机的,并且每个逻辑处理器的使用率预计为 25%。

You can make sure both threads get scheduled for the same execution units by giving them a processor affinity. This can be done in either windows or unix, via either an API (so the program can ask for it) or via administrative interfaces (so an administrator can set it). E.g. in WinXP you can use the Task Manager to limit which logical processor(s) a process can execute on.

Otherwise, the scheduling will be essentially random and you can expect a 25% usage on each logical processor.

鸠魁 2024-07-17 08:22:09

我不知道其他平台,但就英特尔而言,他们发布了很多 英特尔软件网络上的“有关线程的信息”。 他们还有免费的时事通讯(英特尔软件调度),您可以通过电子邮件订阅,并且最近有很多此类文章。

I don't know about the other platforms, but in the case of Intel, they publish a lot of info on threading on their Intel Software Network. They also have a free newsletter (The Intel Software Dispatch) you can subscribe via email and has had a lot of such articles lately.

罗罗贝儿 2024-07-17 08:22:09

操作系统将 2 个活动线程分派到同一核心的可能性,除非线程绑定到特定核心(线程亲和性)。

其背后的原因主要与硬件相关:

  • 操作系统(和 CPU)希望使用尽可能少的功耗,以便尽可能高效地运行任务,以便尽快进入低功耗状态。
  • 在同一个核心上运行所有内容将导致其升温速度更快。 在异常情况下,处理器可能会过热并降低时钟以进行冷却。 过热还会导致 CPU 风扇旋转得更快(比如笔记本电脑)并产生更多噪音。
  • 系统实际上从来没有空闲过。 ISR 和 DPC 每毫秒运行一次(在大多数现代操作系统上)。
  • 在 99.99% 的工作负载中,由于线程从一个核心跳转到另一个核心而导致的性能下降可以忽略不计。
  • 在所有现代处理器中,最后一级缓存是共享的,因此切换核心并不是那么糟糕。
  • 对于多插槽系统 (Numa),操作系统将最大限度地减少从插槽到插槽的跳跃,以便进程保持在其内存控制器“附近”。 在针对此类系统(数十/数百个核心)进行优化时,这是一个复杂的领域。

顺便说一句,操作系统了解 CPU 拓扑的方式是通过 ACPI——BIOS 提供的接口。

总而言之,这一切都归结为系统电源考虑因素(电池寿命、电费、冷却解决方案的噪音)。

The chance that the OS will dispatch 2 active threads to the same core is zero unless the threads were tied to a specific core (thread affinity).

The reasons behind this are mostly HW related:

  • The OS (and the CPU) wants to use as little power as possible so it will run the tasks as efficient as possible in order to enter a low power-state ASAP.
  • Running everything on the same core will cause it to heat up much faster. In pathological conditions, the processor may overheat and reduce its clock to cool down. Excessive heat also cause CPU fans to spin faster (think laptops) and create more noise.
  • The system is never actually idle. ISRs and DPCs run every ms (on most modern OSes).
  • Performance degradation due to threads hopping from core to core are negligible in 99.99% of the workloads.
  • In all modern processors the last level cache is shared thus switching cores isn't so bad.
  • For Multi-socket systems (Numa), the OS will minimize hopping from socket to socket so a process stays "near" its memory controller. This is a complex domain when optimizing for such systems (tens/hundreds of cores).

BTW, the way the OS knows the CPU topology is via ACPI - an interface provided by the BIOS.

To sum things up, it all boils down to system power considerations (battery life, power bill, noise from cooling solution).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文