pyspider dashboard 阻塞 问题
之前也有一次阻塞情况,问题出现的很明显,启动任务后不久 dashboard 中 processor2resullt 就开始增加到100,然后前面的几个队列也会相继增加到100,所有爬虫都会停止工作,我在 terminal 找到了exception, 递归深度超过限制.定位这个问题花了我很久时间, 代码是我直接迁移过来的,使用 bs4 作解析,定位到一行代码会导致出现上述exception, soup.find('title').string
, 我没有看过 pyspider 项目源码,这算是一个坑吧,后续添加 sys.setrecursionlimit(999999)
就没再出现问题.
增加几个爬虫后,又连续两晚出现另一种阻塞的情况了,scheduler2fetcher 100, fetcher2processor 100,processor2resullt 1
,重启 pyspider 才能解决,这次没找到任何 exception, 只有一些警告和错误
[E 170616 00:11:40 tornado_fetcher:212] [599] spider1:44ee8851b5bd4d49df658f37c7f60aaa http://zdb.pedaily.cn/enterprise/show4521/, HTTP 599: Operation timed out after 120001 milliseconds with 0 bytes received 120.00s
[E 170616 00:11:40 processor:202] process spider1:44ee8851b5bd4d49df658f37c7f60aaa http://zdb.pedaily.cn/enterprise/show4521/ -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('no html',)
[E 170616 00:11:45 tornado_fetcher:212] [599] spider1:7f92e4cb28099cd9cb7e2de2190e0953 http://zdb.pedaily.cn/enterprise/show4515/, HTTP 599: Operation timed out after 120000 milliseconds with 0 bytes received 120.00s
[E 170616 00:11:45 processor:202] process spider1:7f92e4cb28099cd9cb7e2de2190e0953 http://zdb.pedaily.cn/enterprise/show4515/ -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('no html',)
[E 170616 00:11:50 tornado_fetcher:212] [599] spider1:9a8ffbd55f823370d3d4a1902fab678e http://zdb.pedaily.cn/enterprise/show4516/, HTTP 599: Operation timed out after 120000 milliseconds with 0 bytes received 120.00s
[E 170616 00:11:50 processor:202] process spider1:9a8ffbd55f823370d3d4a1902fab678e http://zdb.pedaily.cn/enterprise/show4516/ -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('no html',)
[E 170616 00:12:50 tornado_fetcher:212] [599] spider2:92ec929938cb51c2dc8d0c20cbfb0361 http://www.cyzone.cn/r/20170808/4502.html, HTTP 599: Operation timed out after 120001 milliseconds with 0 bytes received 120.00s
[E 170616 00:12:50 processor:202] process spider2:92ec929938cb51c2dc8d0c20cbfb0361 http://www.cyzone.cn/r/20170808/4502.html -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('no html',)
[E 170616 00:13:04 tornado_fetcher:212] [599] spider2:479ced01278e5993ac5ed39912caf5e2 http://www.cyzone.cn/r/20170808/4498.html, HTTP 599: Operation timed out after 120000 milliseconds with 0 bytes received 120.00s
[E 170616 00:13:04 processor:202] process spider2:479ced01278e5993ac5ed39912caf5e2 http://www.cyzone.cn/r/20170808/4498.html -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('no html',)
[W 170616 00:18:27 tornado_fetcher:423] [502] spider1:a3e77851e2523f442318335a11f5b3a8 http://zdb.pedaily.cn/enterprise/show4499/ 1.02s
[W 170616 00:18:37 tornado_fetcher:423] [502] spider2:8b6ee4f75a3821a92481d940378acfdd http://www.cyzone.cn/r/20170808/4363.html 5.04s
几个爬虫都是一个逻辑,生成链接,交给 pyspider ,得到数据,下面是
@config(priority=2)
@catch_status_code_error
@config(age=14 * 24 * 60 * 60)
def detail_page(self, response):
if response.status_code in [301,302,404]:
return
html = response.text
if not html or response.status_code != 200:
raise Exception('no html')
try:
# 处理html的函数,使用 bs4 解析
ret_dic = Parse(html)
except Exception as e:
raise e
return ret_dic
我看这份错误日志是截止0点,之后酒再也没有报错,debug,on_start 也是24小时启动一次,我的猜想是 0点 on_start 开始工作,生成大量爬虫任务,导致意外情况出现阻塞??
result_worker 之后也没有日志, scheduler ,tornado_fetcher 继续打印一段日志后,也停止了?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论