所有处理器不都是生来平等的吗?
我的笔记本电脑有 4 个逻辑处理器(两个物理处理器);逻辑 CPU 1 和 2 映射到核心 1,逻辑 CPU 3 和 4 映射到核心 2(使用 GetLogicalProcessorInformation()
进行验证)。
我在计算机上用两个线程运行了一个多线程矩阵乘法程序。第一次,我使用 SetProcessAffinityMask(hProcess, 0x5)
(这意味着逻辑处理器 1 和 3),而第二次我使用 SetProcessAffinityMask(hProcess, 0xA)
(逻辑处理器2 和 4)。
事实证明,第一个版本的速度大约是第二个版本的两倍,就好像我从来没有对第二个版本进行多线程处理一样。
有谁猜测为什么会发生这种情况?
测量值:
插入电源(完整的 CPU):
- 亲和力掩码:0x3 (0011b)、9 gflop/s
- 亲和力掩码:0x5 (0101b)、17 gflop/s
- 亲和力掩码:0x6 (0110b)、17 gflop/s
- 亲和力掩码:0x9 (1001b)、9 gflop/s
- 亲和力掩码:0xA (1010b)、9 gflop/s
- 亲和力掩码:0xC (1100b)、9 gflop/s
使用电池(降频):
- 亲和力掩码:0x3 (0011b)、5 gflop/s
- 亲和性掩码:0x5 (0101b)、10 gflop/s
- 亲和力掩码:0x6 (0110b),10 gflop/s
- 亲和力掩码:0x9 (1001b)、5 gflop/s
- 亲和力掩码:0xA (1010b),2 gflop/s (--> 非常有趣,为什么使用电池时速度会减半,而使用交流电时速度会正常?!这个速度在 1.5-2.5 gflop/s 之间变化很大,与其他速度不同。)
- 亲和力掩码:0xC (1100b),5 gflop/s
这是否意味着第四个逻辑 CPU 没有执行任何操作 (!)?(带有第四个 CPU 集掩码的所有操作都很慢。)
更新:
我刚刚在高性能配置文件上运行了同样的事情使用电池。结果不一致:这次,掩码 5、6 和 10 获得了 2 倍加速,但掩码 12 没有加速。我将尝试在交流电源上再次运行测试,但最终看起来是这样的结果是电源管理、Turbo Boost、调度不一致等的组合,而且它比我之前想象的更难测量。 :(
My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()
).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5)
(which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA)
(logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Measurements:
Plugged in (full CPU):
- Affinity mask: 0x3 (0011b), 9 gflop/s
- Affinity mask: 0x5 (0101b), 17 gflop/s
- Affinity mask: 0x6 (0110b), 17 gflop/s
- Affinity mask: 0x9 (1001b), 9 gflop/s
- Affinity mask: 0xA (1010b), 9 gflop/s
- Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
- Affinity mask: 0x3 (0011b), 5 gflop/s
- Affinity mask: 0x5 (0101b), 10 gflop/s
- Affinity mask: 0x6 (0110b), 10 gflop/s
- Affinity mask: 0x9 (1001b), 5 gflop/s
- Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.) - Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
Update:
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
SetProcessAffinityMask() 不保证每个核心都有一个线程;只是您拥有的线程将在您允许的核心上运行。
也许操作系统的调度方式不同。
另外,我很惊讶 1 和 2 位于核心 1 上。通常,逻辑处理器编号在物理核心上交错,以提供固有的负载平衡。我希望 1 和 3 位于核心 1 上,2 和 4 位于核心 2 上。
SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.
不,并非所有核心都是平等的。只有一个是启动核心。此外,在许多情况下,所有 IRQ(或至少来自大多数设备的 IRQ)都定向到单个内核。
对于您观察到的行为来说更重要的是,并非所有组核心都是相同的。在 NUMA 内存架构中(自 Intel 超线程和 AMD Opteron 以来,该架构在 x86 中已成为相对主流),有一组理想的处理器可以有效地访问特定的内存区域,而所有其他处理器将付出显着的代价来访问该范围。
使用超线程时,非均匀连接的不是主系统内存,而是 L1 和 L2 缓存。如果您的进程在与同一物理核心关联的两个虚拟处理器之间迁移,则缓存仍然有效。但如果它迁移到另一个物理核心,则必须复制缓存数据并将所有权转移到另一个缓存。对于某些工作负载,这可能会产生很大的影响。
No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.
最好知道这是什么物理 CPU,但我从您关于逻辑处理器的措辞中假设有 1 个物理插槽、2 个 CPU 核心,并且启用了超线程,为您提供 4 个逻辑处理器。
简而言之,对于“处理器”这个复杂的定义,不,并非所有处理器都是一样的。超线程逻辑核心共享执行资源,如果存在对这些资源的争用,它们将不会像单独的物理核心那样快。这种共享可以发生在超线程和多核处理器的不同级别(ALU、执行资源、不同级别的缓存等),但从广义上讲,同一插槽中的物理核心不会受到其他核心的影响太大( s) 正在做的事情,由超线程实现的逻辑核心将受到其超孪生正在做的事情的巨大影响。
不同CPU之间的另一个区别:正如Ben所说,您的操作系统可能会在单个CPU上处理大多数硬件中断,这意味着CPU在其他用途上看起来会更慢,但如果中断负载足以影响附近的性能,我会感到惊讶就这么多。
你得到的结果——在处理器 A 和 B 上(故意不明确是哪两个处理器),你获得的性能是单独 A 的两倍,但在处理器 A 和 C 上,你获得的性能与单独 A 的性能大致相同——当然听起来不错就像超线程一样,区别在于,A 和 C 是同一物理核心中的超孪生,而 B 位于另一个物理核心中。您说 GetLogicalProcessorInformation() 另有声明,但对于它所依赖的 BIOS 表来说,这并非闻所未闻。
我会运行任务管理器,在运行测试之前关注每个 CPU 上的负载,以了解还发生了多少事情以及 Windows 在哪里安排它,然后针对不同的 CPU 组合再次运行测试几次亲和力,看看你是否可以证实或否认这个理论。
It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.
您是否检查了
SetProcessAffinityMask
的返回码以查看是否存在错误?如果调用失败,您可能会卡在一个逻辑处理器上。根据文档,您只能使用GetProcessAffinityMask
结果中设置的位。您说您尝试过
0x5
、0xA
和0x9
掩码。我很想看看0x3
的结果。Have you checked the return code from
SetProcessAffinityMask
to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result ofGetProcessAffinityMask
.You say you've tried masks of
0x5
,0xA
, and0x9
. I'd be curious to see the results with0x3
.