Apache mod_perl 进程挂起在 futex_wait 状态
我运行一个相当流行的基于浏览器的网页游戏,在 Apache (worker) 和 mod_perl 下运行。在高峰时段,当服务器每分钟处理大约 4200 个请求时,大约每 3-15 分钟就会有一个 Apache 进程挂起。
我已经确定这些进程陷入“FUTEX_WAIT”状态,并且似乎没有执行任何操作:它们不消耗 CPU 或在 RAM 中变大。但这是一个严重的问题,因为它们只是坐在那里,占用内存。
我当前的解决方案是一个 cron 作业,它会剔除卡在 futex_wait_queue_me 中的 Apache 进程。但这并不好,因为恰好正在等待挂起的 Apache 进程响应的用户会收到错误(500:服务器关闭连接而不发回数据)。
我无法在我的开发计算机上重现该问题,并且不知道如何继续进行故障排除。我很想知道:我怎样才能进一步诊断这个问题?
编辑:我发现问题发生在流量激增之后,当 Apache 产生更多工作进程,然后尝试剔除它们时。从孩子的角度来看,正常工作时的情况是这样的:
$ sudo strace -p 21764
Process 21764 attached - interrupt to quit
read(5, "!", 1) = 1
tgkill(21764, 21791, SIGHUP) = 0
tgkill(21764, 21791, SIG_0) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf) = -1 EINTR (Interrupted system call)
munmap(0x7f9905750000, 8392704) = 0
munmap(0x7f98f8736000, 8392704) = 0
[...]
madvise(0x7f98e4021000, 73728, MADV_DONTNEED) = 0
exit_group(0) = ?
Process 21764 detached
...但有时会这样:
$ sudo strace -p 24133
Process 24133 attached - interrupt to quit
read(5, "!", 1) = 1
tgkill(24133, 24164, SIGHUP) = 0
tgkill(24133, 24164, SIG_0) = 0
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
tgkill(24133, 24140, SIGUSR1) = 0
futex(0x7f9904f4e9d0, FUTEX_WAIT, 24140, NULL
...并且不再继续。
我不知道如何进一步调试这个。
I run a fairly popular browser-based web game, running under Apache (worker) and mod_perl. During peak times, when the server is handling about 4200 requests per minute, once every 3-15 minutes or so an Apache process will hang.
I have established that these processes get stuck in a "FUTEX_WAIT" state, and don't appear to be doing anything: they don't consume CPU or grow larger in RAM. But it's a serious problem because they just sit there, occupying RAM.
My current solution is a cron job that culls Apache processes stuck in futex_wait_queue_me. But that's not great, because users who happen to be waiting on a response from the hung Apache processes receive errors (500: server closed connection without sending data back).
I have been unable to reproduce the problem on my development machine, and can't figure out how to proceed with troubleshooting. I would love to know: How can I diagose this further?
Edit: I have found that the problem occurs following a burst in traffic, when Apache spawns some more worker processes, then tries to cull them afterward. This is how that looks when it works normally, from the child's point of view:
$ sudo strace -p 21764
Process 21764 attached - interrupt to quit
read(5, "!", 1) = 1
tgkill(21764, 21791, SIGHUP) = 0
tgkill(21764, 21791, SIG_0) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf) = -1 EINTR (Interrupted system call)
munmap(0x7f9905750000, 8392704) = 0
munmap(0x7f98f8736000, 8392704) = 0
[...]
madvise(0x7f98e4021000, 73728, MADV_DONTNEED) = 0
exit_group(0) = ?
Process 21764 detached
... but occasionally it goes like this:
$ sudo strace -p 24133
Process 24133 attached - interrupt to quit
read(5, "!", 1) = 1
tgkill(24133, 24164, SIGHUP) = 0
tgkill(24133, 24164, SIG_0) = 0
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
tgkill(24133, 24140, SIGUSR1) = 0
futex(0x7f9904f4e9d0, FUTEX_WAIT, 24140, NULL
... and proceeds no further.
I don't know how to debug this any further.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是由于 mod-perl 中的错误造成的,现已修复,记录如下:
http: //www.gossamer-threads.com/lists/modperl/dev/104026
This was due to a bug in mod-perl, since fixed, documented here:
http://www.gossamer-threads.com/lists/modperl/dev/104026
选择最低流量时间,并在实时计算机上使用 strace 启动 apache,这样您就可以追踪错误的原因,对于一位互联网博主来说,解决方案归结为
您可以避免
500: 服务器关闭连接而不发送通过使用反向代理设置来返回数据,因此当 apache 检测到没有数据的超时时,它会将客户端请求转发到不同的 mod_perl 子级
,这样客户端的请求不会得到 500,而是需要额外的时间5秒
(不要问我如何操作,请参阅 mod_perl/apache 指南:)
pick the lowest traffic time, and fire up apache with strace on the live machine, so you can track down the cause of the error, for one internet blogger a solution boiled down to
you can avoid
500: server closed connection without sending data back
by using using a reverse-proxy-setup, so when apache detects a timeout without data, it forwards clients the request to a different mod_perl childthat way, instead of client getting 500, his request takes an extra 5 seconds
(don't ask me for how-to , see the mod_perl/apache guide :)