爬取多个不同深度的起始url

发布于 2024-12-11 14:02:34 字数 191 浏览 0 评论 0原文

我正在尝试让 Scrapy 0.12 更改蜘蛛中 start_urls 变量中不同 url 的“最大深度”设置。

如果我正确理解文档,则没有办法,因为 DEPTH_LIMIT 设置对于整个框架来说是全局的,并且不存在“请求源自初始请求”的概念。

有办法绕过这个吗?是否可以使用每个起始 url 和不同的深度限制初始化同一蜘蛛的多个实例?

I'm trying to get Scrapy 0.12 to change it's "maximum depth" setting for different url in the start_urls variable in the spider.

If I understand correctly the documentation there's no way because the DEPTH_LIMIT setting is global for the entire framework and there's no notion of "requests originated from the initial one".

Is there a way to circumvent this? Is it possible to have multiple instances of the same spider initialized with each starting url and different depth limits?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

孤城病女 2024-12-18 14:02:35

抱歉,看来我从一开始就没有正确理解你的问题。更正我的答案:

响应在中具有深度键。您可以检查并采取适当的措施。

class MySpider(BaseSpider):

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True, meta={'start_url': url})

    def parse(self, response):
        if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
            # do something here for exceeding limit for this start url
        else:
            # find links and yield requests for them with passing the start url
            yield Request(other_url, meta={'start_url': response.meta['start_url']})

http://doc.scrapy.org/en /0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

Sorry, looks like i didn't understand you question correctly from the beginning. Correcting my answer:

Responses have depth key in meta. You can check it and take appropriate action.

class MySpider(BaseSpider):

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True, meta={'start_url': url})

    def parse(self, response):
        if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
            # do something here for exceeding limit for this start url
        else:
            # find links and yield requests for them with passing the start url
            yield Request(other_url, meta={'start_url': response.meta['start_url']})

http://doc.scrapy.org/en/0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文