Linux在多核系统上的线程调度差异?

发布于 2024-11-09 03:24:21 字数 945 浏览 0 评论 0原文

我们有几个对延迟敏感的“管道”式程序,与在另一个 Linux 内核上运行时相比,这些程序在一个 Linux 内核上运行时会出现可测量的性能下降。特别是,我们发现 2.6.9 CentOS 4.x (RHEL4) 内核的性能更好,而 CentOS 5.x (RHEL5) 的 2.6.18 内核的性能较差。

我所说的“管道”程序是指具有多个线程的程序。多个线程处理共享数据。每个线程之间有一个队列。因此,线程 A 获取数据,推入 Qab,线程 B 从 Qab 拉取,进行一些处理,然后推入 Qbc,线程 C 从 Qbc 拉取,等等。初始数据来自网络(由第 3 方生成)。

我们基本上测量从接收数据到最后一个线程执行其任务的时间。在我们的应用程序中,从 CentOS 4 迁移到 CentOS 5 时,我们看到增加了 20 到 50 微秒。

我使用了几种方法来分析我们的应用程序,并确定 CentOS 5 上增加的延迟来自队列操作(在特别是,流行)。

不过,我可以通过使用任务集将程序绑定到可用核心的子集来提高 CentOS 5 上的性能(与 CentOS 4 相同)。

所以对我来说,在 CentOS 4 和 5 之间,有一些变化(可能是内核)导致线程的调度方式不同(这种差异对于我们的应用程序来说不是最佳的)。

虽然我可以使用任务集(或通过 sched_setaffinity() 在代码中)“解决”这个问题,但我的偏好是不必这样做。我希望有某种内核可调参数(或者可能是可调参数的集合),其默认值在版本之间发生了变化。

有人有这方面的经验吗?也许还有更多领域需要调查?

更新:在这种特殊情况下,该问题已通过服务器供应商 (Dell) 的 BIOS 更新得到解决。我在这件事上拉了很长一段时间的头发。直到我回到基础,检查供应商的 BIOS 更新。可疑的是,其中一项更新说的是“在最大性能模式下提高性能”之类的内容。升级 BIOS 后,CentOS 5 的速度更快——一般来说,尤其是在我的队列测试和实际生产运行中。

We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on one Linux kernel versus another. In particular, we see better performance with the 2.6.9 CentOS 4.x (RHEL4) kernel, and worse performance with the 2.6.18 kernel from CentOS 5.x (RHEL5).

By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).

We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.

I have used a few methods of profiling our application, and determined that the added latency on CentOS 5 comes from queue operations (in particular, popping).

However, I can improve performance on CentOS 5 (to be the same as CentOS 4) by using taskset to bind the program to a subset of the available cores.

So it appers to me, between CentOS 4 and 5, there was some change (presumably to the kernel) that caused threads to be scheduled differently (and this difference is suboptimal for our application).

While I can "solve" this problem with taskset (or in code via sched_setaffinity()), my preference is to not have to do this. I'm hoping there's some kind of kernel tunable (or maybe collection of tunables) whose default was changed between versions.

Anyone have any experience with this? Perhaps some more areas to investigate?

Update: In this particular case, the issue was resolved by a BIOS update from the server vendor (Dell). I pulled my hair out quite a while on this one. Until I went back to the basics, and checked my vendor's BIOS updates. Suspiciously, one of the updates said something like "improve performance in maximum performance mode". Once I upgraded the BIOS, CentOS 5 was faster---generally speaking, but particularly in my queue tests, and actual production runs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

红尘作伴 2024-11-16 03:24:22

嗯..如果生产者-消费者队列中的 pop() 操作所花费的时间对您的应用程序的整体性能产生显着影响,我建议您的线程/工作流的结构在某个地方不是最佳的。除非队列上存在大量争用,否则如果任何现代操作系统上的任何 PC 队列推送/弹出都会花费超过 µS 左右的时间,我会感到惊讶,即使队列在经典的“计算机科学 117”中使用内核锁- 如何用三个信号量的方式创建有界PC队列。

您能否将工作最少的线程的功能吸收到工作最多的线程中,从而减少流经系统的每个整体工作项的推送/弹出数量?

Hmm.. if the time taken for a pop() operation from a producer-consumer queue is making a significant difference to the overall performance of your app, I would suggest that the structure of your threads/workFlow is not optimal, somewhere . Unless there is a huge amount of contention on the queues, I would be surprised if any P-C queue push/pop on any modern OS would take more than a µS or so, even if the queue uses kernel locks in a classic 'Computer Science 117 - how to make a bounded P-C queue with three semaphores' manner.

Can you just absorb the functionality of the thread/s that do the least work into those that do the most, so reducing the number of push/pop per overall work item that flows through your system?

拒绝两难 2024-11-16 03:24:22

多年来,Linux 调度程序一直是变化和争论激烈的领域。您可能想尝试最新的内核并尝试一下。是的,您可能需要自己编译它——这对您有好处。您可能还(当您拥有较新的内核时)考虑将不同的进程放入不同的容器中,将其他所有内容放入另一个容器中,看看这是否有帮助。

至于其他随机的尝试,您可以提高各种进程的优先级,添加实时语义(注意,具有实时特权的有缺陷的程序可能会导致系统的其余部分挨饿)。

The Linux scheduler has been an intense area of change and contention over the years. You might want to try a very recent kernel and give that a go. Yes, you may have to compile it yourself—it will be good for you. You might also (when you have newer kernel) want to consider putting the different processes in different containers with everything else in an additional one and see if that helps.

As far as other random things to try, you can raise the priority of your various processes, add real time semantics (caution, a buggy program with realtime privs can starve the rest of the system).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文