scrapy爬取图片,遇到https://demo?wx_fmt=jpeg情况,无法爬取
原连接:https://mmbiz.qlogo.cn/mmbiz/...
使用的是scrapy的ImagesPipeline
class ImgPipeline(ImagesPipeline):
"""
scrapy图片处理管道
"""
# 请求图片
def get_media_requests(self, item, info):
content = str(item['content'])
match = re.findall(r'src="(http|https?://.*?)"', content)
item['img_links'] = match
for img_link in item['img_links']:
yield scrapy.Request(img_link)
# 请求完成后
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no image")
item['img_paths'] = image_paths
return item
异常
2017-12-22 10:06:47 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1> referred in <None>
Traceback (most recent call last):
File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
return self.image_downloaded(response, request, info)
File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 115, in get_images
orig_image = Image.open(BytesIO(response.body))
File "C:\Users\zjx\Anaconda3\lib\site-packages\PIL\Image.py", line 2519, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001842C76EFC0>
目前分析问题出现的原因是,该链接返回的是图片的base64,scrapy不能识别
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
同学这个问题你解决了吗,我也遇到了相同的问题(这个网站居然24小时过后才能私信TAT)