Clock_gettime(CLOCK_MONOTONIC_RAW,...) 似乎会导致用户空间冻结大约 5-6 分钟

发布于 2025-01-12 22:51:10 字数 1287 浏览 1 评论 0原文

我有一个在 powerpc CPU (Freescale MPC5125) 上运行 linux 2.6.30 的嵌入式系统。为该设备编写新代码后,我突然观察到用户空间挂起大约 5-6 分钟。

事实证明,新代码调用了clock_gettime(CLOCK_MONOTONIC_RAW,...),并且在其中一个clock_gettime()调用返回了一个不合理的值后系统冻结了。

以下是 strace 记录的输出:

14:39:48.769746 clock_gettime(0x4 /* CLOCK_??? */, {14496, 285316209}) = 0
14:39:48.782047 clock_gettime(0x4 /* CLOCK_??? */, {14496, 285627946}) = 0
14:39:48.782354 select(14, [4 5 6 7 9 10 12 13], NULL, NULL, {19, 999689}) = 1 (in [13], left {19, 853317})
14:39:48.928554 read(13, "\0\0\0\257\0\0\0\1", 8) = 8
14:39:48.928917 clock_gettime(0x4 /* CLOCK_??? */, {1266889381, 847702609}) = 0
14:45:15.612681 time(NULL)              = 1646750715
14:45:15.613026 ...
14:45:15.818364 clock_gettime(0x4 /* CLOCK_??? */, {14819, 27615047}) = 0

系统继续响应 ICMP 回显请求,接受新的 TCP 连接,甚至确认这些新 TCP 连接上的传入数据。但整个用户空间似乎都挂了,甚至串行控制台上没有回声。

5-6分钟后,系统恢复并继续工作,TCP会话没有超时,所有ssh会话继续工作,串行控制台上的输入再次回显。

当我使用 CLOCK_MONOTONIC 而不是 CLOCK_MONOTONIC_RAW 时,问题似乎消失了。我可以对我的软件进行此更改,并接受这个问题,这是我的系统上的一个已知内核错误,这很容易避免,但该系统上的其他程序也使用 CLOCK_MONOTONIC_RAW,我无法更改它们。我至少应该了解内核内部出了什么问题。

不幸的是,系统的 PCB 上没有 JTAG 焊盘,因此我无法调试内核,同时用户空间被阻止。

所以,这是我的问题:

  • 有人观察过这样的问题吗?
  • Clock_gettime() 调用阻塞所有用户空间进程的原因可能是什么?
  • 我该怎么做才能继续在内核中寻找这个问题?

I have an embedded systems running linux 2.6.30 on a powerpc CPU (Freescale MPC5125). After writing new code for this device, I suddenly observed user space hangs for about 5-6 minutes.

It turned out, that the new code calls clock_gettime(CLOCK_MONOTONIC_RAW,...) and the system freezes after one of these clock_gettime() calls returned an unplausible value.

Here is the recorded output of strace:

14:39:48.769746 clock_gettime(0x4 /* CLOCK_??? */, {14496, 285316209}) = 0
14:39:48.782047 clock_gettime(0x4 /* CLOCK_??? */, {14496, 285627946}) = 0
14:39:48.782354 select(14, [4 5 6 7 9 10 12 13], NULL, NULL, {19, 999689}) = 1 (in [13], left {19, 853317})
14:39:48.928554 read(13, "\0\0\0\257\0\0\0\1", 8) = 8
14:39:48.928917 clock_gettime(0x4 /* CLOCK_??? */, {1266889381, 847702609}) = 0
14:45:15.612681 time(NULL)              = 1646750715
14:45:15.613026 ...
14:45:15.818364 clock_gettime(0x4 /* CLOCK_??? */, {14819, 27615047}) = 0

The system continues to respond to ICMP echo requests, accepts new TCP connections and even acks incoming data on these new TCP connections. But the entire user space seem to be hanging, even no echo on the serial console.

After 5-6 minutes, the system recovers and continues to work, TCP sessions did not time out, all ssh sessions continue to work and input on the serial console is echoed again.

The problem seems to go away, when I use CLOCK_MONOTONIC instead of CLOCK_MONOTONIC_RAW. I could just do this change to my software and just live with regarding this problem a a known kernel bug on my system, which could be easy to avoid, but other programs on that system also use CLOCK_MONOTONIC_RAW and I cannot change them. I should at least understand, what is going wrong inside of the kernel.

Unfortunately, the system does not have a JTAG pads on its PCB, so I cannot debug the kernel, while the user space is blocked.

So, here are my questions:

  • Has anybody ever observed problems like this?
  • What might be the cause of clock_gettime() calls blocking all user space processes?
  • What can I do to continune hunting this problem in the kernel?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文