在存储的数据上重放 Scrapy 蜘蛛

发布于 2024-12-10 11:13:59 字数 424 浏览 0 评论 0原文

我已经开始使用 Scrapy 来抓取一些网站。如果我稍后向我的模型添加新字段或更改我的解析函数,我希望能够离线“重播”下载的原始数据以再次抓取它。看起来 Scrapy 能够在某一时刻将原始数据存储在重播文件中:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

但是这个功能似乎在当前版本中被删除了Scrapy 版本。还有其他方法可以实现这一目标吗?

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

站稳脚跟 2024-12-17 11:13:59

如果您运行crawl --record=[cache.file] [scraper],您就可以使用replay [scraper]

或者,您可以使用 HttpCacheMiddleware 将其包含在 DOWNLOADER_MIDDLEWARES 中:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

如果这样做,每次运行scraper,它会首先检查文件系统。

If you run crawl --record=[cache.file] [scraper], you'll be able then use replay [scraper].

Alternatively, you can cache all responses with the HttpCacheMiddleware by including it in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

If you do this, every time you run the scraper, it will check the file system first.

倾听心声的旋律 2024-12-17 11:13:59

您可以按照所述启用 HTTPCACHE_ENABLED http:// /scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled

缓存所有http请求和响应来实现简历爬取。

或者尝试暂停和恢复抓取作业
http://scrapy.readthedocs.org/en/latest/topics/jobs.html

You can enable HTTPCACHE_ENABLED as said http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled

to cache all http request and response to implement resume crawling.

OR try Jobs to pause and resume crawling
http://scrapy.readthedocs.org/en/latest/topics/jobs.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文