scrapy 图片爬取偶尔会报错

发布于 2022-09-11 20:21:39 字数 3347 浏览 19 评论 0

使用scrapy爬取百度百科图片的时候偶尔会报以下错误,一直没找到解决方案,望大佬指点

2019-06-10 11:48:31 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg> referred in <None>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 101, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 105, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 125, in get_images
    image, buf = self.convert_image(orig_image)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 151, in convert_image
    image.save(buf, 'JPEG')
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1899, in save
    self.load()
  File "/usr/lib/python3/dist-packages/PIL/ImageFile.py", line 228, in load
    "(%d bytes not processed)" % len(b))
OSError: image file is truncated (4 bytes not processed)

以下为pipline相关代码

class BaidubaikeImagePipeline(ImagesPipeline):

    # 保持图片原有的名字不变
    def file_path(self, request, response=None, info=None):
        image_guid = request.url.split('/')[-1]
        image_save_path = request.meta['image_save_path']
        if image_save_path:
            filePath = u'{0}/{1}'.format(image_save_path, image_guid)
            return filePath
        else:
            filePath = u'{0}/{1}'.format('full', image_guid)
            return filePath

    def get_media_requests(self, item, info):
        if item is not None:
            if item.get('image_urls'):
                for image_url in item['image_urls']:
                    if 'data:image/png;' in image_url: #base64编码把图片数据翻译成标准ASCII字符
                        pass
                    else:
                        yield scrapy.Request(image_url, meta={'image_save_path':item['image_save_path']}, dont_filter=True)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]  # ok判断是否下载成功
        if not image_path:
            print('Item contains no images')
            # raise DropItem("Item contains no images")
        return item

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文