为什么跨 CPU 时块 I/O 完成需要这么长时间?

发布于 2024-08-14 11:47:01 字数 736 浏览 6 评论 0原文

我正在尝试从 Linux 块驱动程序中榨取高端存储设备的最大性能。目前让我有点困惑的一个问题是:如果用户任务在一个 CPU 上启动 I/O 操作(读或写),而设备中断发生在另一个 CPU 上,那么我会产生大约 80 微秒的延迟任务恢复执行。

我可以使用 O_DIRECT 针对原始块设备看到这一点,因此这与页面缓存或文件系统无关。驱动程序使用 make_request 来接收操作,因此它没有请求队列,也不使用任何内核 I/O 调度程序(您必须相信我,这样速度更快)。

我可以向自己证明,问题发生在一个 CPU 上调用 bio_endio 和在另一个 CPU 上重新安排任务之间。如果任务在同一个 CPU 上,则启动速度非常快,如果任务在另一个物理 CPU 上,则需要更长的时间 - 在我当前的测试系统(Intel 5520 [NUMA] 芯片组上的 x86_64)上通常要长约 80 微秒)。

通过将进程和 IRQ cpu 亲和力设置为同一物理 CPU,我可以立即使性能加倍,但这不是一个好的长期解决方案 - 我宁愿能够获得良好的性能,无论 I/O 源自何处。而且我只有一个 IRQ,因此我一次只能将其引导到一个 CPU——如果许多线程在许多 CPU 上运行,那就不好了。

我可以在 Centos 5.4 的 2.6.18 到主线 2.6.32 的内核上看到这个问题。

所以问题是:如果我从另一个 CPU 调用 bio_endio,为什么用户进程需要更长的时间才能恢复?这是调度程序的问题吗?有什么办法可以消除或降低延迟吗?

I am trying to squeeze the most performance out of a Linux block driver for a high-end storage device. One problem that has me a bit stumped at the moment is this: if a user task starts an I/O operation (read or write) on one CPU, and the device interrupt occurs on another CPU, I incur about 80 microseconds of delay before the task resumes execution.

I can see this using O_DIRECT against the raw block device, so this is not page cache or filesystem- related. The driver uses make_request to receive operations, so it has no request queue and does not utilize any kernel I/O scheduler (you'll have to trust me, it's way faster this way).

I can demonstrate to myself that the problem occurs between calling bio_endio on one CPU and the task being rescheduled on another CPU. If the task is on the same CPU, it starts very quickly, and if the task is on another physical CPU, it takes a lot longer -- usually about 80 microseconds longer on my current test system (x86_64 on Intel 5520 [NUMA] chipset).

I can instantly double my performance by setting the process and IRQ cpu affinity to the same physical CPU, but that's not a good long-term solution-- I'd rather be able to get good performance no matter where the I/Os originate. And I only have one IRQ so I can only steer it to one CPU at a time -- no good if many threads are running on many CPUs.

I can see this problem on kernels from Centos 5.4's 2.6.18 to the mainline 2.6.32.

So the question is: why does it take longer for the user process to resume, if I called bio_endio from another CPU? Is this a scheduler issue? And is there any way to eliminate or lower the delay?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

穿透光 2024-08-21 11:47:01

如果您在特定 CPU 上完成 I/O,则该处理器可以立即开始在新线程上工作 - 如果您在请求它的线程所在的同一处理器上完成 I/O,则下一个线程很可能成为你完成 i/o 的人。

另一方面,如果您在不同的处理器上完成,请求 i/o 的线程将不会立即运行 - 它必须等到当前正在执行的任何程序完成其量程或以其他方式放弃 CPU。

据我了解。

If you finish your I/O on a particular CPU, then that processor is immediately free to start working on a new thread - if you finish your i/o on the same processor as the thread the requested it, then the next thread is likely to be the one you finished i/o for.

On the other hand, if you finish on a different processor, the thread that requested the i/o won't get to run immediately - it has to wait until whatever's currently executing finishes its quantum or otherwise relinquishes the CPU.

As far as I understand.

陪我终i 2024-08-21 11:47:01

这可能只是从完成 BIOS 的 CPU 向计划任务的 CPU 发出 IPI 所固有的延迟 - 要测试这一点,请尝试使用 idle=poll 启动。

It could just be the latency inherent in issuing an IPI from the CPU that completed the bio to the CPU where the task gets scheduled - to test this, try booting with idle=poll.

怎樣才叫好 2024-08-21 11:47:01

此补丁刚刚发布到 LKML,实现了 块设备层的QUEUE_FLAG_SAME_CPU,其描述为:

添加一个标志以使请求完成
提交请求的CPU。这
标志意味着QUEUE_FLAG_SAME_COMP。经过
默认关闭。

听起来这可能正是您所需要的......

This patch was just posted to LKML, implementing QUEUE_FLAG_SAME_CPU in the block device layer, which is described as:

Add a flag to make request complete on
cpu where request is submitted. The
flag implies QUEUE_FLAG_SAME_COMP. By
default, it is off.

It sounds like it might be just what you need...

二智少女猫性小仙女 2024-08-21 11:47:01

看来我有点误解了这个问题:它似乎与缓存未命中有关;当处理中断的 cpu 不是启动 i/o 的 cpu 时,cpu 的利用率可以达到 100%,然后一切都会变慢,给人的印象是 cpu 之间的通信存在很长的延迟。

感谢大家的想法。

Looks like I misunderstood the problem a bit: it seems to be related to cache misses; when the cpu handling interrupts wasn't the cpu that started the i/o, the cpu can hit 100% utilization, and then everything slows down, giving the impression that there is a long delay communicating between cpus.

Thanks to everyone for their ideas.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文