在存储的数据上重放 Scrapy 蜘蛛
我已经开始使用 Scrapy 来抓取一些网站。如果我稍后向我的模型添加新字段或更改我的解析函数,我希望能够离线“重播”下载的原始数据以再次抓取它。看起来 Scrapy 能够在某一时刻将原始数据存储在重播文件中:
http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168
但是这个功能似乎在当前版本中被删除了Scrapy 版本。还有其他方法可以实现这一目标吗?
I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:
http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168
But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您运行
crawl --record=[cache.file] [scraper]
,您就可以使用replay [scraper]
。或者,您可以使用
HttpCacheMiddleware
将其包含在DOWNLOADER_MIDDLEWARES
中:如果这样做,每次运行scraper,它会首先检查文件系统。
If you run
crawl --record=[cache.file] [scraper]
, you'll be able then usereplay [scraper]
.Alternatively, you can cache all responses with the
HttpCacheMiddleware
by including it inDOWNLOADER_MIDDLEWARES
:If you do this, every time you run the scraper, it will check the file system first.
您可以按照所述启用 HTTPCACHE_ENABLED http:// /scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled
缓存所有http请求和响应来实现简历爬取。
或者尝试暂停和恢复抓取作业
http://scrapy.readthedocs.org/en/latest/topics/jobs.html
You can enable HTTPCACHE_ENABLED as said http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled
to cache all http request and response to implement resume crawling.
OR try Jobs to pause and resume crawling
http://scrapy.readthedocs.org/en/latest/topics/jobs.html