Mongrel 因 100% CPU / EBADF(错误文件描述符)而挂起
我们有一个服务器,有 10 个运行 mongrel_cluster 实例的 apache 在他们面前,时不时地有一个或几个悬挂着。 数据库中没有看到任何活动(我们正在使用 activerecord 会话)。 Mysql 与 innodb 表。 show innodb status 显示没有锁。 展示 进程列表什么也没显示。
服务器是linux debian 4.0
Ruby是:ruby 1.8.6(2008-03-03 patchlevel 114)[i486-linux]
Rails是:Rails 1.1.2(是的,相当旧)
我们使用本机mysql连接器(gem安装mysql)
“strace -p PID”在循环中为挂起的杂种提供以下内容 进程:
gettimeofday({1219834026, 235289}, NULL) = 0
select(4, [3], [0], [], {0, 905241}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235477}, NULL) = 0
select(4, [3], [0], [], {0, 905053}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235654}, NULL) = 0
select(4, [3], [0], [], {0, 904875}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235829}, NULL) = 0
select(4, [3], [0], [], {0, 904700}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236017}, NULL) = 0
select(4, [3], [0], [], {0, 904513}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236192}, NULL) = 0
select(4, [3], [0], [], {0, 904338}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236367}, NULL) = 0
...
我使用了lsof,发现该进程使用了67个文件描述符(lsof -p PID |wc -l)
有没有其他方法可以调试这个,这样我就可以例如 确定哪个文件描述符是“坏”的? 还有其他信息或建议吗? 还有其他人看过这个吗?
该网站的使用率相当高,但并不过分,平均负载通常约为 0.3。
一些附加信息。 我安装了 mongrelproctitle 来显示什么 挂起的进程正在执行操作,并且它们似乎挂在某个方法上 使用 file_column / 来自数据库的图像 / 显示图像 rmagick 调整图像大小并使图像灰度化。
尚无定论 问题就在这里,但这是一个怀疑。 以下内容是否有明显错误? 方法 如果订单不包含图像,则显示静态图像,否则 根据订单调整图像大小。 缓存的东西是为了让图像得到 每次在浏览器中更新。 图像已插入页面中 带有普通图像标签。
代码:
def preview_image
@order = session[:order]
if @order.image.nil?
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(EMPTY_PIC.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
else
@pic = Image.read(@order.image)[0]
if (@order.crop)
@pic.crop!(@order.crop[:x1].to_i, @order.crop[:y1].to_i,
@order.crop[:width].to_i, @order.crop[:height].to_i, true)
end
@pic.resize!(103,130)
@pic = @pic.quantize(256, Magick::GRAYColorspace)
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(@pic.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
end
end
如果有人能发现其中的任何问题,这里是 lsof 输出。 不 知道它将如何在此消息中格式化...
lsof: WARNING: can't stat() ext3 file system /dev/.static/dev
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
mongrel_r 11628 username cwd DIR 9,2 4096 1870688
/home/domains/example.com/usernameOrder/releases/20080831121802
mongrel_r 11628 username rtd DIR 9,1 4096 2 /
mongrel_r 11628 username txt REG 9,1 3564 167172
/usr/bin/ruby1.8
mongrel_r 11628 username mem REG 0,0 0
[heap] (stat: No such file or directory)
mongrel_r 11628 username DEL REG 0,8 15560245
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560242
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560602
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560601
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560684
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560683
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560685
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560568
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560607
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560569
/dev/zero
mongrel_r 11628 username mem REG 9,1 1933648 456972
/usr/lib/libmysqlclient.so.15.0.0
mongrel_r 11628 username DEL REG 0,8 15442414
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560546
/dev/zero
mongrel_r 11628 username mem REG 9,1 67408 457393
/lib/i686/cmov/libresolv-2.7.so
mongrel_r 11628 username mem REG 9,1 17884 457386
/lib/i686/cmov/libnss_dns-2.7.so
mongrel_r 11628 username DEL REG 0,8 15560541
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560246
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560693
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560608
/dev/zero
mongrel_r 11628 username mem REG 9,1 25700 164963
/usr/lib/gconv/gconv-modules.cache
mongrel_r 11628 username mem REG 9,1 83708 457384
/lib/i686/cmov/libnsl-2.7.so
mongrel_r 11628 username mem REG 9,1 140602 506903
/var/lib/gems/1.8/gems/mysql-2.7/lib/mysql.so
mongrel_r 11628 username mem REG 9,1 1282816 180935
...
mongrel_r 11628 username 1w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 2w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 3u IPv4 15442350 TCP
localhost:8001 (LISTEN)
mongrel_r 11628 username 4w REG 9,2 118943548 1575355
/home/domains/example.com/usernameOrder/shared/log/production.log
mongrel_r 11628 username 5u REG 9,1 145306 234226
/tmp/mongrel.11628.0 (deleted)
mongrel_r 11628 username 7u unix 0xc3c12480 15442417
socket
mongrel_r 11628 username 11u REG 9,1 50 234180
/tmp/CGI.11628.2
mongrel_r 11628 username 12u REG 9,1 26228 234227
/tmp/CGI.11628.3
我已经安装了 monit 来监视服务器。 由于 PID 文件问题,还没有自动重启,但也许我会得到支持删除过时 PID 文件的最新版本。
不过,如果能真正解决这个问题就好了,因为如果服务器需要一直重新启动(每天大约 10 次),那么有人会遇到断开连接等情况,
当这种情况时,杂种进程不会占用任何大量的内存。发生这种情况,并且机器甚至没有交换,因此可能不是内存泄漏。
total used free shared buffers cached
Mem: 4152796 4083000 69796 0 616624 2613364
-/+ buffers/cache: 853012 3299784
Swap: 1999992 52 1999940
We have a server with 10 running mongrel_cluster instances with apache
in front of them, and every now and then one or some of them hang.
No activity is seen in the database (we're using activerecord sessions).
Mysql with innodb tables. show innodb status shows no locks. show
processlist shows nothing.
The server is linux debian 4.0
Ruby is: ruby 1.8.6 (2008-03-03 patchlevel 114) [i486-linux]
Rails is: Rails 1.1.2 (yes, quite old)
We're using the native mysql connector (gem install mysql)
"strace -p PID" gives the following in a loop for the hung mongrel
process:
gettimeofday({1219834026, 235289}, NULL) = 0
select(4, [3], [0], [], {0, 905241}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235477}, NULL) = 0
select(4, [3], [0], [], {0, 905053}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235654}, NULL) = 0
select(4, [3], [0], [], {0, 904875}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235829}, NULL) = 0
select(4, [3], [0], [], {0, 904700}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236017}, NULL) = 0
select(4, [3], [0], [], {0, 904513}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236192}, NULL) = 0
select(4, [3], [0], [], {0, 904338}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236367}, NULL) = 0
...
I used lsof and found that the process used 67 file descriptors (lsof -p
PID |wc -l)
Is there any other way I can debug this, so that I could for example
determine which file descriptor is "bad"?
Any other info or suggestions? Anybody else seen this?
The site is fairly used, but not overly so, load averages usually around
0.3.
Some additional info. I installed mongrelproctitle to show what the
hung processes were doing, and it seems they are hanging on a method
that displays images using file_column / images from the database /
rmagick to resize and make the images greyscale.
Not conclusive the
problem is here, but it is a suspicion.
Is there something obviously wrong with the following? The method
displays a static image if the order doesn't contain an image, else the
image resized from the order. The cache stuff is so that the image gets
updated in the browser every time. The image is inserted in the page
with a normal image tag.
code:
def preview_image
@order = session[:order]
if @order.image.nil?
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(EMPTY_PIC.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
else
@pic = Image.read(@order.image)[0]
if (@order.crop)
@pic.crop!(@order.crop[:x1].to_i, @order.crop[:y1].to_i,
@order.crop[:width].to_i, @order.crop[:height].to_i, true)
end
@pic.resize!(103,130)
@pic = @pic.quantize(256, Magick::GRAYColorspace)
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(@pic.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
end
end
Here is the lsof output if anybody can find any problems in it. Don't
know how it will format in this message...
lsof: WARNING: can't stat() ext3 file system /dev/.static/dev
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
mongrel_r 11628 username cwd DIR 9,2 4096 1870688
/home/domains/example.com/usernameOrder/releases/20080831121802
mongrel_r 11628 username rtd DIR 9,1 4096 2 /
mongrel_r 11628 username txt REG 9,1 3564 167172
/usr/bin/ruby1.8
mongrel_r 11628 username mem REG 0,0 0
[heap] (stat: No such file or directory)
mongrel_r 11628 username DEL REG 0,8 15560245
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560242
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560602
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560601
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560684
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560683
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560685
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560568
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560607
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560569
/dev/zero
mongrel_r 11628 username mem REG 9,1 1933648 456972
/usr/lib/libmysqlclient.so.15.0.0
mongrel_r 11628 username DEL REG 0,8 15442414
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560546
/dev/zero
mongrel_r 11628 username mem REG 9,1 67408 457393
/lib/i686/cmov/libresolv-2.7.so
mongrel_r 11628 username mem REG 9,1 17884 457386
/lib/i686/cmov/libnss_dns-2.7.so
mongrel_r 11628 username DEL REG 0,8 15560541
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560246
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560693
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560608
/dev/zero
mongrel_r 11628 username mem REG 9,1 25700 164963
/usr/lib/gconv/gconv-modules.cache
mongrel_r 11628 username mem REG 9,1 83708 457384
/lib/i686/cmov/libnsl-2.7.so
mongrel_r 11628 username mem REG 9,1 140602 506903
/var/lib/gems/1.8/gems/mysql-2.7/lib/mysql.so
mongrel_r 11628 username mem REG 9,1 1282816 180935
...
mongrel_r 11628 username 1w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 2w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 3u IPv4 15442350 TCP
localhost:8001 (LISTEN)
mongrel_r 11628 username 4w REG 9,2 118943548 1575355
/home/domains/example.com/usernameOrder/shared/log/production.log
mongrel_r 11628 username 5u REG 9,1 145306 234226
/tmp/mongrel.11628.0 (deleted)
mongrel_r 11628 username 7u unix 0xc3c12480 15442417
socket
mongrel_r 11628 username 11u REG 9,1 50 234180
/tmp/CGI.11628.2
mongrel_r 11628 username 12u REG 9,1 26228 234227
/tmp/CGI.11628.3
I have installed monit to monitor the server. No automatic restarts yet because of the PID file issue, but maybe I will get the newest version which supports deleting stale PID-files.
It would be nice though to actually fix the problem, because somebody will get disconnects etc if the server need to be restarted all the time (~10 times a day)
The mongrel-processes don't take any large amount of memory when this is happening, and the machine isn't even swapping, so it's probably not a memory leak.
total used free shared buffers cached
Mem: 4152796 4083000 69796 0 616624 2613364
-/+ buffers/cache: 853012 3299784
Swap: 1999992 52 1999940
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
考虑使用ImageScience,众所周知,RMagick 会泄漏大量内存和锁。
Consider using ImageScience, RMagick is known to leak massive amounts of memory and lock.
部署 Rails 应用程序(分步指南)一书中的第 6.3 章有一个很好的部分,介绍了在 Linux 上安装和配置监视实用程序 Monit 以及使用它来监视您的杂种。 当你的杂种失败时,它可以重新启动它们。
由于磁盘上存在重复的 PID 文件,旧版本的 Mongrel 无法重新启动。 较新的版本支持 --clean 选项,该选项将删除剩余的 PID 文件(如果存在)。 因此,您必须将 Mongrel 升级到支持 --clean 的版本才能解决过时的 PID 文件问题,仅靠 Monit 无法做到这一点。
Chapter 6.3 in the book Deploying Rails Applications (A Step by Step Guide) has a good section on installing and configuring the Monitoring utility Monit on Linux and using it to monitor your mongrels. It can restart your mongrels when they fail.
Older versions of Mongrel had trouble re-starting because of a duplicate PID file existing on disk. Newer versions support the --clean option that will get rid of the leftover PID files, if they exist. So you have to upgrade Mongrel to a version that supports --clean to get around the stale PID file issue, Monit alone can't do this.