以编程方式多次运行 Scrapy 蜘蛛,无需多个进程或同时运行
我有一个 Scrapy 蜘蛛,它可以抓取网页的内容,并且网页的项目将取决于传递给蜘蛛的参数。 scrapy runningpider myspider -a ID=1
我一直在尝试使用 CrawlerProcess 和 CrawlerRunner 创建一种“蜘蛛作曲家”,但问题是我无法连续多次运行蜘蛛,因为反应堆无法重新启动 (twisted.internet.error.ReactorNotRestartable
)。我管理做到这一点的唯一方法是同时运行同一个蜘蛛的多个实例,但这给了我另一个错误,twisted.internet.error.CannotListenError: Couldn't Listen on 127.0.0.1:6073: [Errno 98]地址已被使用
。这是有道理的,因为我试图同时抓取太多网页。
基本上我想做的事情是:
for id in ids:
process.crawl(MySpider, ID=id)
process.start()
我看到人们建议对此类事情使用多个进程,但由于我总共抓取了几百万个网页,所以它并不是真正可持续的,而且我也不想无意中 DDOS 他们的服务器。
I have a Scrapy spider that scrapes the content of a webpage and the webpages' item will be dependent on argument passed to the spider. scrapy runspider myspider -a ID=1
I've been trying to create a sort of "spider composer" by using CrawlerProcess and CrawlerRunner but the problem is that I can't run a spider multiple times sequentially because the reactor can't be restarted (twisted.internet.error.ReactorNotRestartable
). The only way I manage do this is to run multiple instances of the same spider simultaneously but this gives me another error, twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [Errno 98] Address already in use
. Which makes sense since I'm trying to scrape too many webpages simultaneously.
Basically what I am trying to do:
for id in ids:
process.crawl(MySpider, ID=id)
process.start()
I've seen people suggesting using multiple processes for these kinds of things but since I am scraping up to a couple of million webpages in total then it's not really sustainable, and I also don't want to unintentionally DDOS their servers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论