以编程方式多次运行 Scrapy 蜘蛛，无需多个进程或同时运行

发布于 2025-01-09 16:09:34 字数 611 浏览 1 评论 0原文

我有一个 Scrapy 蜘蛛，它可以抓取网页的内容，并且网页的项目将取决于传递给蜘蛛的参数。 scrapy runningpider myspider -a ID=1

我一直在尝试使用 CrawlerProcess 和 CrawlerRunner 创建一种“蜘蛛作曲家”，但问题是我无法连续多次运行蜘蛛，因为反应堆无法重新启动 (twisted.internet.error.ReactorNotRestartable)。我管理做到这一点的唯一方法是同时运行同一个蜘蛛的多个实例，但这给了我另一个错误，twisted.internet.error.CannotListenError: Couldn't Listen on 127.0.0.1:6073: [Errno 98]地址已被使用。这是有道理的，因为我试图同时抓取太多网页。

基本上我想做的事情是：

for id in ids:
    process.crawl(MySpider, ID=id)
    process.start()

我看到人们建议对此类事情使用多个进程，但由于我总共抓取了几百万个网页，所以它并不是真正可持续的，而且我也不想无意中 DDOS 他们的服务器。

原文

I have a Scrapy spider that scrapes the content of a webpage and the webpages' item will be dependent on argument passed to the spider. scrapy runspider myspider -a ID=1

I've been trying to create a sort of "spider composer" by using CrawlerProcess and CrawlerRunner but the problem is that I can't run a spider multiple times sequentially because the reactor can't be restarted (twisted.internet.error.ReactorNotRestartable). The only way I manage do this is to run multiple instances of the same spider simultaneously but this gives me another error, twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [Errno 98] Address already in use. Which makes sense since I'm trying to scrape too many webpages simultaneously.

Basically what I am trying to do:

for id in ids:
    process.crawl(MySpider, ID=id)
    process.start()

I've seen people suggesting using multiple processes for these kinds of things but since I am scraping up to a couple of million webpages in total then it's not really sustainable, and I also don't want to unintentionally DDOS their servers.

分享到QQ

分享到微博