如何在scrapy中检查损坏的链接?
我有一系列链接,如何检查损坏的链接方法。一般来说,我需要实现类似这样的构造,
def parse(self, response, **cb_kwargs):
for link in links:
*if response HTTP 404 callback=self.parse_data...*
*elif response HTTP 200 callback=self.parse_product...*
def parse_data(self, response, **cb_kwargs):
pass
def parse_product(self, response, **cb_kwargs):
pass
事实是我需要知道第一个方法(解析)中的状态,这可能吗?
I have an array of links, how can I check in the broken link method or not. In general, I need to implement something like this construction
def parse(self, response, **cb_kwargs):
for link in links:
*if response HTTP 404 callback=self.parse_data...*
*elif response HTTP 200 callback=self.parse_product...*
def parse_data(self, response, **cb_kwargs):
pass
def parse_product(self, response, **cb_kwargs):
pass
the fact is that I need to know the status in the first method (parse), is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在
stat_urls
中添加链接,并在parse()
中添加链接,您可以检查response.status
(并获取response.url
>) 并且您可以直接运行代码来处理此 url - 无需使用Requests
再次发送它 - 除了Scrapy
(默认)跳过相同的请求。但是 Scrapy 会跳过 URL 的 parse() ,这会产生错误,因此您必须更改列表handle_httpstatus_list 。
编辑:
我没有测试它,但在文档中您还可以看到
使用 errbacks 捕获请求处理中的异常
展示了如何使用
errback=function
在出现错误时向function
发送响应
。还有
访问errback 函数中的附加数据
You could add links in
stat_urls
and inparse()
you can checkresponse.status
(and getresponse.url
) and you can run directly code to process this url - there is no need to send it again withRequests
- besidesScrapy
(as default) skip the same requests.But
Scrapy
skipsparse()
for urls which gives errors so you have to change listhandle_httpstatus_list
.EDIT:
I didn't test it but in documentation you can also see
Using errbacks to catch exceptions in request processing
which shows how to use
errback=function
to sendresponse
tofunction
when it gets error.There is also
Accessing additional data in errback functions