当前位置：文江博客话题详情

Python Scrapy web-scraping

仅在爬行时仅抓取400误差，而不是在使用scrapy shell时

发布于 2025-02-13 00:10:12 字数 1439 浏览 0 评论 0 原文

我正在收到400 HTTP状态代码，或者使用scrapy Crawl功能在 https://www.bbc.com/news/topics/c3np65e0jq4t 。我正在使用以下代码并命令来启动疤痕。

class bbc_url_spider(scrapy.Spider):
    name = 'bbc_url_spider'
    start_urls = ['https://www.bbc.co.uk/news/topics/c3np65e0jq4ts'
                ]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

    def parse(self, response):

        for url in set(response.css('a.ssrcss-1j8v9o5-PromoLink.e1f5wbog0::attr(href)').getall()):
            yield{
                'url': url
            }

        next_button = response.xpath('.//div[contains(@class,"e1b2sq420")]')[-1]
        next_page_link = next_button.css('a::attr(href)').get()

        if next_page_link is not None:
           yield response.follow('https://www.bbc.co.uk/news/topics/c3np65e0jq4t' + next_page_link, callback=self.parse)

scrapy crawl bbc_url_spider -O bbc_urls.json

返回此日志。

log

但是，当使用scrapy shell时，我能够使用简单的fetch访问完全相同的网页。

shell

不确定为什么会发生这种情况。我尝试使用不同的用户代理和中间件，但似乎无效。任何建议将不胜感激。

原文

I am receiving a 400 HTTP status code is not handled or not allowed error when using the Scrapy Crawl function to scrape BBC News article URLs from https://www.bbc.com/news/topics/c3np65e0jq4t. I am using the below code and command to initiate the scarping.

class bbc_url_spider(scrapy.Spider):
    name = 'bbc_url_spider'
    start_urls = ['https://www.bbc.co.uk/news/topics/c3np65e0jq4ts'
                ]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

    def parse(self, response):

        for url in set(response.css('a.ssrcss-1j8v9o5-PromoLink.e1f5wbog0::attr(href)').getall()):
            yield{
                'url': url
            }

        next_button = response.xpath('.//div[contains(@class,"e1b2sq420")]')[-1]
        next_page_link = next_button.css('a::attr(href)').get()

        if next_page_link is not None:
           yield response.follow('https://www.bbc.co.uk/news/topics/c3np65e0jq4t' + next_page_link, callback=self.parse)

scrapy crawl bbc_url_spider -O bbc_urls.json

Which returns this log.

log

However when using the Scrapy shell I am able to access the exact same webpage using a simple fetch.

shell

Not sure why this is happening. I have tried using different user agents and middleware but nothing seems to work. Any advice would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

送舟行 2025-02-20 00:10:12

中的状态代码400 和 500 范围是错误，因此零工忽略了这些错误。

如果您有一个特定的情况，您仍然想调用回调方法，例如这些状态代码的 Parse ，则可以通过在蜘蛛类中添加此来做到这一点：

    handle_httpstatus_list = [400, 404]  # note it's a list

在大多数情况下， 400 将是一个错误。如果需要，可以使用 errback 来处理这些错误。参见 docs 有关详细信息。

Status code in 400 and 500 range are errors and thus Scrapy ignores these by design.

If you have a specific case where you want to still call the callback methods such as parse for these status codes, you can do so by adding this in your Spider class:

    handle_httpstatus_list = [400, 404]  # note it's a list

In most cases, 400 would be an error. If you want, you can use errback to handle these errors. See docs for details.

回复收藏 0 原文

~没有更多了~

关于作者

中性美

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

仅在爬行时仅抓取400误差，而不是在使用scrapy shell时

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

仅在爬行时仅抓取400误差，而不是在使用scrapy shell时

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。