如何在scrapy中检查损坏的链接？

发布于 2025-01-09 18:21:21 字数 404 浏览 0 评论 0原文

我有一系列链接，如何检查损坏的链接方法。一般来说，我需要实现类似这样的构造，

def parse(self, response, **cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response, **cb_kwargs):
    pass

def parse_product(self, response, **cb_kwargs):
    pass

事实是我需要知道第一个方法（解析）中的状态，这可能吗？

原文

I have an array of links, how can I check in the broken link method or not. In general, I need to implement something like this construction

def parse(self, response, **cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response, **cb_kwargs):
    pass

def parse_product(self, response, **cb_kwargs):
    pass

the fact is that I need to know the status in the first method (parse), is this possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一束光，穿透我孤独的魂 2025-01-16 18:21:21

您可以在 stat_urls 中添加链接，并在 parse() 中添加链接，您可以检查 response.status （并获取 response.url >) 并且您可以直接运行代码来处理此 url - 无需使用 Requests 再次发送它 - 除了 Scrapy （默认）跳过相同的请求。

但是 Scrapy 会跳过 URL 的 parse() ，这会产生错误，因此您必须更改列表handle_httpstatus_list 。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]
    
    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)
        
        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
#    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    
})
c.crawl(MySpider)
c.start()

编辑：

我没有测试它，但在文档中您还可以看到

使用 errbacks 捕获请求处理中的异常

展示了如何使用errback=function 在出现错误时向 function 发送响应。

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

还有

访问errback 函数中的附加数据

You could add links in stat_urls and in parse() you can check response.status (and get response.url) and you can run directly code to process this url - there is no need to send it again with Requests - besides Scrapy (as default) skip the same requests.

But Scrapy skips parse() for urls which gives errors so you have to change list handle_httpstatus_list.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]
    
    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)
        
        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
#    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    
})
c.crawl(MySpider)
c.start()

EDIT:

I didn't test it but in documentation you can also see

Using errbacks to catch exceptions in request processing

which shows how to use errback=function to send response to function when it gets error.

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

There is also

Accessing additional data in errback functions

回复收藏 0 原文

~没有更多了~