scrapy 图片爬取偶尔会报错
使用scrapy爬取百度百科图片的时候偶尔会报以下错误,一直没找到解决方案,望大佬指点
2019-06-10 11:48:31 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg> referred in <None>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 101, in file_downloaded
return self.image_downloaded(response, request, info)
File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 105, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 125, in get_images
image, buf = self.convert_image(orig_image)
File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 151, in convert_image
image.save(buf, 'JPEG')
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1899, in save
self.load()
File "/usr/lib/python3/dist-packages/PIL/ImageFile.py", line 228, in load
"(%d bytes not processed)" % len(b))
OSError: image file is truncated (4 bytes not processed)
以下为pipline相关代码
class BaidubaikeImagePipeline(ImagesPipeline):
# 保持图片原有的名字不变
def file_path(self, request, response=None, info=None):
image_guid = request.url.split('/')[-1]
image_save_path = request.meta['image_save_path']
if image_save_path:
filePath = u'{0}/{1}'.format(image_save_path, image_guid)
return filePath
else:
filePath = u'{0}/{1}'.format('full', image_guid)
return filePath
def get_media_requests(self, item, info):
if item is not None:
if item.get('image_urls'):
for image_url in item['image_urls']:
if 'data:image/png;' in image_url: #base64编码把图片数据翻译成标准ASCII字符
pass
else:
yield scrapy.Request(image_url, meta={'image_save_path':item['image_save_path']}, dont_filter=True)
def item_completed(self, results, item, info):
image_path = [x['path'] for ok, x in results if ok] # ok判断是否下载成功
if not image_path:
print('Item contains no images')
# raise DropItem("Item contains no images")
return item
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论