仅在爬行时仅抓取400误差,而不是在使用scrapy shell时

发布于 2025-02-13 00:10:12 字数 1439 浏览 0 评论 0 原文

我正在收到400 HTTP状态代码,或者使用scrapy Crawl功能在 https://www.bbc.com/news/topics/c3np65e0jq4t 。我正在使用以下代码并命令来启动疤痕。

class bbc_url_spider(scrapy.Spider):
    name = 'bbc_url_spider'
    start_urls = ['https://www.bbc.co.uk/news/topics/c3np65e0jq4ts'
                ]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

    def parse(self, response):

        for url in set(response.css('a.ssrcss-1j8v9o5-PromoLink.e1f5wbog0::attr(href)').getall()):
            yield{
                'url': url
            }

        next_button = response.xpath('.//div[contains(@class,"e1b2sq420")]')[-1]
        next_page_link = next_button.css('a::attr(href)').get()

        if next_page_link is not None:
           yield response.follow('https://www.bbc.co.uk/news/topics/c3np65e0jq4t' + next_page_link, callback=self.parse)
scrapy crawl bbc_url_spider -O bbc_urls.json

返回此日志。

log

但是,当使用scrapy shell时,我能够使用简单的fetch访问完全相同的网页。

shell

不确定为什么会发生这种情况。我尝试使用不同的用户代理和中间件,但似乎无效。任何建议将不胜感激。

I am receiving a 400 HTTP status code is not handled or not allowed error when using the Scrapy Crawl function to scrape BBC News article URLs from https://www.bbc.com/news/topics/c3np65e0jq4t. I am using the below code and command to initiate the scarping.

class bbc_url_spider(scrapy.Spider):
    name = 'bbc_url_spider'
    start_urls = ['https://www.bbc.co.uk/news/topics/c3np65e0jq4ts'
                ]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

    def parse(self, response):

        for url in set(response.css('a.ssrcss-1j8v9o5-PromoLink.e1f5wbog0::attr(href)').getall()):
            yield{
                'url': url
            }

        next_button = response.xpath('.//div[contains(@class,"e1b2sq420")]')[-1]
        next_page_link = next_button.css('a::attr(href)').get()

        if next_page_link is not None:
           yield response.follow('https://www.bbc.co.uk/news/topics/c3np65e0jq4t' + next_page_link, callback=self.parse)
scrapy crawl bbc_url_spider -O bbc_urls.json

Which returns this log.

log

However when using the Scrapy shell I am able to access the exact same webpage using a simple fetch.

shell

Not sure why this is happening. I have tried using different user agents and middleware but nothing seems to work. Any advice would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

送舟行 2025-02-20 00:10:12

中的状态代码400 500 范围是错误,因此零工忽略了这些错误。

如果您有一个特定的情况,您仍然想调用回调方法,例如这些状态代码的 Parse ,则可以通过在蜘蛛类中添加此来做到这一点:

    handle_httpstatus_list = [400, 404]  # note it's a list

在大多数情况下, 400 将是一个错误。如果需要,可以使用 errback 来处理这些错误。参见 docs 有关详细信息。

Status code in 400 and 500 range are errors and thus Scrapy ignores these by design.

If you have a specific case where you want to still call the callback methods such as parse for these status codes, you can do so by adding this in your Spider class:

    handle_httpstatus_list = [400, 404]  # note it's a list

In most cases, 400 would be an error. If you want, you can use errback to handle these errors. See docs for details.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文