文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
8.4 信号
信号提供了一种为系统中发生的事件添加回调的机制,比如当爬虫开启时或当item被抓取时。可以使用crawler.signals.connect()方法hook到它们上(下一节将会给出使用它的一个示例)。信号总共有11个,理解它们的最简单方式可能就是在实践中看到它们。我创建了一个项目,在其中创建了一个扩展,hook了所有可以使用的信号。另外,我还创建了一个Item管道、一个下载器和一个爬虫中间件,可以记录所有的方法调用。该项目使用的爬虫非常简单,只对两个item进行yield操作,然后抛出异常。
def parse(self, response): for i in range(2): item = HooksasyncItem() item['name'] = "Hello %d" % i yield item raise Exception("dead")
在第二个item中,我配置了Item管道,以抛出DropItem异常。
本示例的完整代码可以从ch08/hooksasync得到。
使用该项目,我们可以更好地理解某个信号是什么时候发出的。请查看如下执行中日志行之间的注释(为了简短起见,省略了部分行)。
$ scrapy crawl test ... many lines ... # First we get those two signals... INFO: Extension, signals.spider_opened fired INFO: Extension, signals.engine_started fired # Then for each URL we get a request_scheduled signal INFO: Extension, signals.request_scheduled fired ...# when download completes we get response_downloaded INFO: Extension, signals.response_downloaded fired INFO: DownloaderMiddlewareprocess_response called for example.com # Work between response_downloaded and response_received INFO: Extension, signals.response_received fired INFO: SpiderMiddlewareprocess_spider_input called for example.com # here our parse() method gets called... and then SpiderMiddleware used INFO: SpiderMiddlewareprocess_spider_output called for example.com # For every Item that goes through pipelines successfully... INFO: Extension, signals.item_scraped fired # For every Item that gets dropped using the DropItem exception... INFO: Extension, signals.item_dropped fired # If your spider throws something else... INFO: Extension, signals.spider_error fired # ... the above process repeats for each URL # ... till we run out of them. then... INFO: Extension, signals.spider_idle fired # by hooking spider_idle you can schedule further Requests. If you don't # the spider closes. INFO: Closing spider (finished) INFO: Extension, signals.spider_closed fired # ... stats get printed # and finally engine gets stopped. INFO: Extension, signals.engine_stopped fired
虽然只有11个信号,可能会感觉比较有限,但是每个Scrapy的默认中间件都是只使用它们实现的,因此它们肯定够用。请注意,除了spider_idle、spider_error、request_scheduled、response_received和response_downloaded以外的所有其他信号,都可以返回延迟,而不是真实值。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论