如何监控“卡住”的情况Python 脚本?
我有一个数据密集型 Python 脚本,它使用 HTTP 连接来下载数据。我通常会整夜运行它。有时连接会失败,或者网站暂时不可用。我有基本的错误处理功能,可以捕获这些异常并定期重试,重试 5 分钟后优雅退出(并记录错误)。
然而,我注意到有时工作会冻结。不会引发任何错误,并且作业仍在运行,有时在最后一条打印消息之后几个小时。
最好的方法是什么:
- 监视Python脚本,
- 检测在给定的时间间隔后是否没有响应,
- 退出如果没有响应,
- 并开始另一个?
更新
感谢大家的帮助。正如你们中的一些人所指出的,urllib 和 socket 模块没有正确设置超时。我将 Python 2.5 与 Freebase 和 urllib2 模块一起使用,并捕获和处理 MetawebErrors 和 urllib2 .URL错误。以下是最后一个脚本挂起 12 小时后 err 输出的示例:
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 369, in _httpreq_json
resp, body = self._httpreq(*args, **kws)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 355, in _httpreq
return self._http_request(url, method, body, headers)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/httpclients.py", line 33, in __call__
resp = self.opener.open(req)
File "/usr/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/usr/lib/python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
r = h.getresponse()
File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
response.begin()
File "/usr/lib/python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.5/socket.py", line 372, in readline
data = recv(1)
KeyboardInterrupt
您会注意到底部的套接字错误。由于我使用的是 Python 2.5 并且无法访问第三个 urllib2.urlopen 选项,是否有其他方法来监视和捕获此错误?例如,我正在捕获 URLErrrrors - 在 urllib2 或 socket 中是否存在其他类型的错误,我可以捕获这对我有帮助?
I have a data-intensive Python script that uses HTTP connections to download data. I usually run it overnight. Sometimes the connection will fail, or a website will be unavailable momentarily. I have basic error-handling that catches these exceptions and tries again periodically, exiting gracefully (and logging errors) after 5 minutes of retrying.
However, I've noticed that sometimes the job just freezes. No error is thrown, and the job is still running, sometimes hours after the last print message.
What is the best way to:
- monitor a Python script,
- detect if it is unresponsive after a given interval,
- exit it if it is unresponsive,
- and start another one?
UPDATE
Thank you all for your help. As a few of you have pointed out, the urllib and socket modules don't have timeouts set properly. I am using Python 2.5 with the Freebase and urllib2 modules, and catching and handling MetawebErrors and urllib2.URLErrors. Here is a sample of err output after the last script hung for 12 hours:
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 369, in _httpreq_json
resp, body = self._httpreq(*args, **kws)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 355, in _httpreq
return self._http_request(url, method, body, headers)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/httpclients.py", line 33, in __call__
resp = self.opener.open(req)
File "/usr/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/usr/lib/python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
r = h.getresponse()
File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
response.begin()
File "/usr/lib/python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.5/socket.py", line 372, in readline
data = recv(1)
KeyboardInterrupt
You'll notice the socket error at the bottom. Since I'm using Python 2.5 and don't have access to the third urllib2.urlopen option, is there another way to watch for and catch this error? For example, I'm catching URLErrrors - is there another type of error in urllib2 or socket that I can catch which will help me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
听起来你的脚本中有一个错误。答案不是监视错误,而是寻找错误并修复它。
如果没有看到一些代码,我们无法帮助您找到错误。但作为一般想法,您可能希望使用日志记录来查明问题发生的位置,并编写 单元测试可帮助您建立对代码的哪些部分没有错误的信心。
另一个想法是使用 Ctrl-C 打破“卡住”的程序并研究回溯消息。它会显示你的程序最后执行的是哪一行。
这可能会给您提示脚本哪里出了问题。
It sounds like there is a bug in your script. The answer is not to monitor the bug, but to hunt down the bug and fix it.
We can't help you find the bug without seeing some code. But as a general idea, you might want to use logging to pinpoint where the problem is occurring, and write unit tests to help you build confidence about which parts of your code do not have the bug.
Another idea is to break your "stuck" program with Ctrl-C and to study the traceback message. It will show you what line your program was last executing.
That may give you a clue where the script is going wrong.
由于该程序正在进行网络通信,因此我会启动一个调试代理,例如 Charles http://www.charlesproxy.com/ 看看脚本和服务器之间的来回是否有任何奇怪的事情发生。
还要考虑到套接字模块默认没有设置超时,因此可能会挂起。然而,从 python 2.6 开始,您可以将第三个参数传递给 urllib2.urlopen(如果您使用的是 urllib2),指定请求超时时间(以秒为单位)。这样,脚本就会出错,而不是紧张地等待来自可能不合作的服务器的响应。如果您还没有,我会在尝试更详细的内容之前先检查一下这些内容。
Python 2.5 更新:
要在 python 中执行此操作 < 2.6 中,您必须直接在 urllib2 使用的套接字模块中设置超时值。我还没有尝试过这个,但它大概有效。在 http://www.voidspace.org.uk/python/articles 找到此信息/urllib2.shtml:
Since the program is doing web communication, I'd fire up a debugging proxy like Charles http://www.charlesproxy.com/ and see if there's anything kooky happening in the back-and-forth between your script and the server.
Also consider that the socket module has no timeout set by default and therefore can hang. As of python 2.6, however, you can pass a third argument to urllib2.urlopen (if you are using urllib2, that is), specifying a request timeout period in seconds. That way the script will error out rather than go catatonic waiting from a response from a perhaps uncooperative server. If you haven't already, I'd check these things out before trying anything more elaborate.
Update for python 2.5:
To do this in python < 2.6, you would have to set the timeout value directly in the socket module, which urllib2 uses. I haven't tried this, but it presumably works. Found this info at http://www.voidspace.org.uk/python/articles/urllib2.shtml:
执行您要求的一个简单方法是利用当前程序发送到另一个监视输出的收集程序的 UDP 数据包。如果在一定时间内没有收到数据包,它会杀死另一个 python 进程,然后重新启动另一个进程
a simple way to do what you ask is to make use of UDP packets sent by your current program to another harvesting program that monitors the output. If it doesn't receive a packet in a certain amount of time, it kills the other python process then restarts another one
您可以在
pdb
中运行脚本,并在怀疑它被冻结时进行中断。它本身不会起作用,但可能会帮助您找出冻结的原因。You could run your script in
pdb
and break in when you suspect it's frozen. It won't work on its own, but might help you figure out why it's freezing.