我怎样才能停止一个scrapy CrawlSpider并稍后从它停止的地方恢复?

发布于 2024-12-03 02:13:50 字数 187 浏览 1 评论 0原文

我有一个 Scrapy CrawlSpider ,它有一个非常大的要抓取的 URL 列表。我希望能够停止它,保存当前状态并稍后恢复,而不必重新开始。有没有办法在Scrapy框架内实现这一点?

I have a Scrapy CrawlSpider that has a very large list of URLs to crawl. I would like to be able to stop it, saving the current status and resume it later without having to start over. Is there a way to accomplish this within the Scrapy framework?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

又怨 2024-12-10 02:13:50

只是想分享最新的 scrapy 版本中包含该功能,但参数名称已更改。您应该像这样使用它:

 scrapy crawl thespider --set JOBDIR=run1

有关更多信息,请访问 http:// /doc.scrapy.org/en/latest/topics/jobs.html#job-directory

Just wanted to share that feature is included in latest scrapy version, but parameter name is changed. You should use it like this:

 scrapy crawl thespider --set JOBDIR=run1

For more information here http://doc.scrapy.org/en/latest/topics/jobs.html#job-directory

北斗星光 2024-12-10 02:13:50

几个月前有一个关于 ML 的问题:http ://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1

引用巴勃罗:

我们不仅在考虑,而且还在努力。有
目前我的 MQ 中有两个工作补丁添加了此功能
如果有人想尝试早期预览(他们需要在
命令):
http://hg.scrapy.org/users/pablo/mq/file /提示/scheduler_single_spider....
http://hg.scrapy.org/users/pablo/mq /file/tip/persistent_scheduler.patch
像以前一样运行蜘蛛(无持久性):

scrapy 爬行 thespider 

要运行在目录中存储调度程序+dupefilter状态的蜘蛛:

scrapy 爬行 thespider --set SCHEDULER_DIR=run1 

在抓取过程中,您可以按 ^C 取消抓取并恢复抓取
稍后:

scrapy 爬行 thespider --set SCHEDULER_DIR=run1 

SCHEDULER_DIR 设置名称在最终设置之前必然会更改
发布,但想法是一样的 - 你传递一个目录
在哪里保存状态。

There was a question on the ML just few months ago: http://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1

Quote Pablo:

We're not only considering it, but also working on it. There are
currently two working patches in my MQ that add this functionality in
case anyone wants to try an early preview (they need to be applied in
order):
http://hg.scrapy.org/users/pablo/mq/file/tip/scheduler_single_spider....
http://hg.scrapy.org/users/pablo/mq/file/tip/persistent_scheduler.patch
To run a spider as before (no persistence):

scrapy crawl thespider 

To run a spider storing scheduler+dupefilter state in a dir:

scrapy crawl thespider --set SCHEDULER_DIR=run1 

During the crawl, you can hit ^C to cancel the crawl and resume it
later with:

scrapy crawl thespider --set SCHEDULER_DIR=run1 

The SCHEDULER_DIR setting name is bound to change before the final
release, but the idea will be the same - that you pass a directory
where to persist the state.

逆蝶 2024-12-10 02:13:50

Scrapy 现在在其网站上提供了此功能,记录如下:

这是实际的命令:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Scrapy now has the working feature for this on their site documented here:

Here's the actual command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文