Linux 每个进程的资源限制 - 一个深奥的红帽之谜
我有自己的多线程 C 程序,它的速度随着 CPU 核心的数量平滑地扩展。我可以使用 1、2、3 等线程运行它并获得线性加速。在 6 核上速度最高可达约 5.5 倍Ubuntu Linux 机器上的 CPU。
我有机会在配备 4 个四核 Xeon 处理器、运行 Red Hat Enterprise Linux 的非常高端 Sunfire x4450 上运行该程序。我热切地期待看到 16 个内核可以以多快的速度运行我的 16 线程程序。 但它的运行速度与两个线程相同!
经过一番费力的调试和调试,我发现我的程序确实正在创建所有线程,它们确实是同时运行的,但线程本身比应有的速度要慢。 2 个线程的运行速度比 1 个线程快约 1.7 倍,但 3、4、8、10、16 个线程的运行速度仅为 1.9 倍!我可以看到所有线程都在运行(没有停止或休眠),它们只是速度很慢。
为了检查硬件是否有故障,我同时独立运行了程序的十六个副本。他们全都全速奔跑。确实有 16 个核心,它们确实全速运行,并且确实有足够的 RAM(事实上,这台机器有 64GB,而我每个进程只使用 1GB)。
所以,我的问题是是否有一些操作系统解释,也许是一些每个进程的资源限制,它会自动缩减线程调度以防止一个进程占用机器。
线索是:
- 我的程序不访问磁盘或网络。这是CPU 限制的。它的速度呈线性变化 Ubuntu Linux 中的单 CPU 盒 用于 1-6 个线程的六核 i7。 6 线程实际上加速了 6 倍。
- 我的程序运行速度从来没有比 此 16 核 Sunfire 的速度提高了 2 倍 Xeon 盒,适用于任意数量的线程 从 2 到 16。
- 运行 16 个副本 我的程序单线程运行 完美,所有 16 个同时运行 全速。
- 顶部显示 1600% 分配的 CPU 数。 /proc/cpuinfo 显示 所有 16 个核心均以 2.9GHz 运行 转速(非低频怠速 1.6GHz)
- 有 48GB 可用 RAM,不可交换。
发生什么事了?是否有一些进程 CPU 限制策略?如果是的话我该如何测量? 还有什么可以解释这种行为?
感谢您提出解决这个问题的想法,即 2010 年至强减速之谜!
I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
- My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup. - My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16. - Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed. - top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz) - There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我最初的猜测是共享内存瓶颈。根据您的说法,使用 2 个 CPU 后您的性能几乎持平。你最初责怪 Redhat,但我很好奇如果你在相同的硬件上安装 Ubuntu 会发生什么。当然,我假设您在这两个测试中都运行 64 位 SMP 内核。
主板可能不可能达到使用 2 个 CPU 的峰值。您有另一台具有多核的机器,可以提供更好的性能。你的新机器开启了超线程吗? (这个答案与旧机器相比如何?)。您不是偶然在虚拟化环境中运行吗?
总的来说,你的证据表明某个地方存在一个极其缓慢的瓶颈。正如您所说,您不受 I/O 限制,因此只剩下 CPU 和内存。要么是硬件出了问题,要么是硬件出了问题。通过改变另一个来测试一个,你会很快缩小你的可能性。
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
对 rlimit 进行一些研究 - 您运行的 shell/用户帐户很可能具有一些 RH 默认或管理员设置的资源限制。
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
当您看到这种奇怪的扩展行为时,尤其如果在多个线程而不是多个进程中出现问题,则需要开始考虑的一件事是锁争用和其他同步原语的影响,这可能会导致导致在不同处理器上运行的线程必须相互等待,可能会迫使多个内核将其缓存刷新到主内存。
这意味着内存架构开始发挥作用,当单块硅片上有 6 个核心时,速度会比在 4 个独立处理器之间进行协调时快得多。具体来说,单 CPU 情况可能根本不需要访问主内存来进行锁定操作 - 一切都可能在 L3 缓存级别处理,从而允许 CPU 在数据在后台刷新到主内存时继续处理事情。
虽然我预计OP在这段时间之后已经对这个问题失去了兴趣(或者甚至可能无法再访问硬件),但检查这一点的一种方法是看看如果进程亲和力扩大到4个线程是否会有所改善设置为将其锁定到单个物理 CPU。更好的是分析应用程序本身,看看它把时间花在哪里。当你改变架构并增加核心数量时,猜测瓶颈在哪里会变得越来越困难,所以你真的需要开始衡量事情直接,如本例所示: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html