当前位置：文江博客话题详情

我怎样才能停止一个scrapy CrawlSpider并稍后从它停止的地方恢复？

发布于 2024-12-03 02:13:50 字数 187 浏览 1 评论 0原文

我有一个 Scrapy CrawlSpider ，它有一个非常大的要抓取的 URL 列表。我希望能够停止它，保存当前状态并稍后恢复，而不必重新开始。有没有办法在Scrapy框架内实现这一点？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

又怨 2024-12-10 02:13:50

只是想分享最新的 scrapy 版本中包含该功能，但参数名称已更改。您应该像这样使用它：

 scrapy crawl thespider --set JOBDIR=run1

有关更多信息，请访问 http:// /doc.scrapy.org/en/latest/topics/jobs.html#job-directory

Just wanted to share that feature is included in latest scrapy version, but parameter name is changed. You should use it like this:

 scrapy crawl thespider --set JOBDIR=run1

For more information here http://doc.scrapy.org/en/latest/topics/jobs.html#job-directory

回复收藏 0 原文

北斗星光 2024-12-10 02:13:50

几个月前有一个关于 ML 的问题：http ://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1

引用巴勃罗：

我们不仅在考虑，而且还在努力。有
目前我的 MQ 中有两个工作补丁添加了此功能
如果有人想尝试早期预览（他们需要在
命令）：
http://hg.scrapy.org/users/pablo/mq/file /提示/scheduler_single_spider....
http://hg.scrapy.org/users/pablo/mq /file/tip/persistent_scheduler.patch
像以前一样运行蜘蛛（无持久性）：
scrapy 爬行 thespider 
要运行在目录中存储调度程序+dupefilter状态的蜘蛛：
scrapy 爬行 thespider --set SCHEDULER_DIR=run1 
在抓取过程中，您可以按 ^C 取消抓取并恢复抓取
稍后：
scrapy 爬行 thespider --set SCHEDULER_DIR=run1 
SCHEDULER_DIR 设置名称在最终设置之前必然会更改
发布，但想法是一样的 - 你传递一个目录
在哪里保存状态。

There was a question on the ML just few months ago: http://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1

Quote Pablo:

We're not only considering it, but also working on it. There are
currently two working patches in my MQ that add this functionality in
case anyone wants to try an early preview (they need to be applied in
order):
http://hg.scrapy.org/users/pablo/mq/file/tip/scheduler_single_spider....
http://hg.scrapy.org/users/pablo/mq/file/tip/persistent_scheduler.patch
To run a spider as before (no persistence):
scrapy crawl thespider 
To run a spider storing scheduler+dupefilter state in a dir:
scrapy crawl thespider --set SCHEDULER_DIR=run1 
During the crawl, you can hit ^C to cancel the crawl and resume it
later with:
scrapy crawl thespider --set SCHEDULER_DIR=run1 
The SCHEDULER_DIR setting name is bound to change before the final
release, but the idea will be the same - that you pass a directory
where to persist the state.

回复收藏 0 原文

逆蝶 2024-12-10 02:13:50

Scrapy 现在在其网站上提供了此功能，记录如下：

这是实际的命令：

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Scrapy now has the working feature for this on their site documented here:

Here's the actual command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

回复收藏 0 原文

~没有更多了~

关于作者

深陷

暂无简介

0 文章

0 评论

25 人气

关注发私信

已经忘了多久

文章 0 评论 0

关注

15867725375

文章 0 评论 0

关注

LonelySnow

文章 0 评论 0

关注

走过海棠暮

文章 0 评论 0

关注

轻许诺言

文章 0 评论 0

关注

信馬由缰

文章 0 评论 0

友情链接

文江博客

我怎样才能停止一个scrapy CrawlSpider并稍后从它停止的地方恢复？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

我怎样才能停止一个scrapy CrawlSpider并稍后从它停止的地方恢复？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。