为什么线程的状态是运行但不使用任何CPU?
今天我发现一个很奇怪的问题。 我运行的是Redhat Enterprise Linux 6,CPU是Intel E31275(4核,8线程)。我发现一个内核线程(我将其称为 my_thread)无法正常工作。 通过“ps”命令,我发现my_thread的状态一直在运行:
ps ax
5545 ? R 3:14 [my_thread]
15774 ttyS0 Ss 0:00 -bash
...
但是它的运行时间始终是3:14。既然运行了,为什么总时间没有增加? 从 proc 文件 /proc/5545/sched 中,我发现该线程的所有统计信息(包括唤醒计数 (se.nr_wakeups))也始终相同。
从/proc/5545/stack中,我发现这个线程调用了这个函数并且从未返回:
interruptible_sleep_on_timeout(&q, 3*HZ);
理论上,如果没有其他线程唤醒该线程,这个函数将每3秒返回一次。每次函数返回后,/proc/5545/sched中的se.nr_wakeups都会加1。但是当我发现线程有问题后,这种情况就不再发生了。
有人有一些想法吗? Interruptible_sleep_on_timeout() 是否有可能永远不会返回?
更新: 我发现如果我为此线程设置 CPU 关联性,就不会出现该问题。如果我将其固定到专用核心,那么一切都可以。 SMP调度有什么问题吗?
再次更新: 我在BIOS中禁用超线程后,直到现在才发现这样的问题。
Today I found a very strange problem.
I ran Redhat Enterprise Linux 6, and the CPU was Intel E31275 (4 cores, 8 threads). I found one kernel thread (I called it as my_thread) didn't work correctly.
With "ps" command, I found the status of my_thread was always running:
ps ax
5545 ? R 3:14 [my_thread]
15774 ttyS0 Ss 0:00 -bash
...
But its running time was always 3:14. Since it ws running, why didn't the total time increase?
From the proc file /proc/5545/sched, I found the all statistics including wakeups count (se.nr_wakeups) for this thread was always the same, too.
From /proc/5545/stack, I found this thread called this function and never returned:
interruptible_sleep_on_timeout(&q, 3*HZ);
In theory this function would return every 3 seconds if no other threads woke up the thread. Each time after the function returned, se.nr_wakeups in /proc/5545/sched would be increased by 1. But this never happened after I found the thread had some problems.
Does any one have some ideas? Is it possible that interruptible_sleep_on_timeout() never returns?
Update:
I find the problem won't occur if I set CPU affinity for this thread. If I pin it to a dedicated core, then everything is OK. Are there any problems with SMP scheduling?
Update again:
After I disalbe hyperthread in BIOS, I have not seen such a problem until now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,R 表示线程不处于运行状态,但可以运行。也就是说,它并不意味着它正在运行,而是意味着它处于允许调度程序选择它运行的状态。两者之间有很大的区别。
类似地,interruptible_sleep_on_timeout(&q, 3*HZ);在 3 个 jiffies 后不会运行线程,而是使其可在 3 个 jiffies 后运行 - 事实上,您在“ps”中看到它可用于运行,因此可能确实发生了超时。
由于您没有提及有关内核线程的任何内容,因此我什至不知道它是在您自己的代码中还是在标准内核代码中,因此我无法真正详细回答。
您所描述的情况的一个可能原因是其他一些线程(用户或内核)的优先级高于您的线程,因此调度程序永远不会选择它来运行。如果是这样,则它可能不是以实时优先级(SCHED_FIFO 或 SCHED_RR)运行的线程。
First off, R indicates the thread is not in running state but runnable. That is, it does not mean it runs, it means it is in a state the scheduler is allowed to pick it for running. There is a big difference between the two.
In a similar sense interruptible_sleep_on_timeout(&q, 3*HZ); will not run the thread after 3 jiffies, but rather make it available for running after 3 jiffies - and indeed you see it in "ps" as available for running, so possibly the timeout has indeed occurred.
Since you did not say anything about the kernel thread in question I don't even know if it is in your own code or standard kernel code so I cannot really answer in detail.
One possible reason for the situation you described is that some other thread (user or kernel) has higher priority then your thread and so the scheduler never picks it for running. If so, it is not probably a thread running in real time priority (SCHED_FIFO or SCHED_RR).