如何在scrapy中检查损坏的链接?

发布于 2025-01-09 18:21:21 字数 404 浏览 0 评论 0原文

我有一系列链接,如何检查损坏的链接方法。一般来说,我需要实现类似这样的构造,

def parse(self, response, **cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response, **cb_kwargs):
    pass

def parse_product(self, response, **cb_kwargs):
    pass

事实是我需要知道第一个方法(解析)中的状态,这可能吗?

I have an array of links, how can I check in the broken link method or not. In general, I need to implement something like this construction

def parse(self, response, **cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response, **cb_kwargs):
    pass

def parse_product(self, response, **cb_kwargs):
    pass

the fact is that I need to know the status in the first method (parse), is this possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

您可以在 stat_urls 中添加链接,并在 parse() 中添加链接,您可以检查 response.status (并获取 response.url >) 并且您可以直接运行代码来处理此 url - 无需使用 Requests 再次发送它 - 除了 Scrapy (默认)跳过相同的请求。

但是 Scrapy 会跳过 URL 的 parse() ,这会产生错误,因此您必须更改列表handle_httpstatus_list 。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]
    
    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)
        
        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
#    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    
})
c.crawl(MySpider)
c.start()

编辑:

我没有测试它,但在文档中您还可以看到

使用 errbacks 捕获请求处理中的异常

展示了如何使用errback=function 在出现错误时向 function 发送响应

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

还有

访问errback 函数中的附加数据

You could add links in stat_urls and in parse() you can check response.status (and get response.url) and you can run directly code to process this url - there is no need to send it again with Requests - besides Scrapy (as default) skip the same requests.

But Scrapy skips parse() for urls which gives errors so you have to change list handle_httpstatus_list.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]
    
    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)
        
        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
#    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    
})
c.crawl(MySpider)
c.start()

EDIT:

I didn't test it but in documentation you can also see

Using errbacks to catch exceptions in request processing

which shows how to use errback=function to send response to function when it gets error.

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

There is also

Accessing additional data in errback functions

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文