当前位置：文江博客话题详情

scrapy 图片爬取偶尔会报错

发布于 2022-09-11 20:21:39 字数 3347 浏览 19 评论 0

使用scrapy爬取百度百科图片的时候偶尔会报以下错误，一直没找到解决方案，望大佬指点

2019-06-10 11:48:31 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg> referred in <None>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 101, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 105, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 125, in get_images
    image, buf = self.convert_image(orig_image)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 151, in convert_image
    image.save(buf, 'JPEG')
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1899, in save
    self.load()
  File "/usr/lib/python3/dist-packages/PIL/ImageFile.py", line 228, in load
    "(%d bytes not processed)" % len(b))
OSError: image file is truncated (4 bytes not processed)

以下为pipline相关代码

class BaidubaikeImagePipeline(ImagesPipeline):

    # 保持图片原有的名字不变
    def file_path(self, request, response=None, info=None):
        image_guid = request.url.split('/')[-1]
        image_save_path = request.meta['image_save_path']
        if image_save_path:
            filePath = u'{0}/{1}'.format(image_save_path, image_guid)
            return filePath
        else:
            filePath = u'{0}/{1}'.format('full', image_guid)
            return filePath

    def get_media_requests(self, item, info):
        if item is not None:
            if item.get('image_urls'):
                for image_url in item['image_urls']:
                    if 'data:image/png;' in image_url: #base64编码把图片数据翻译成标准ASCII字符
                        pass
                    else:
                        yield scrapy.Request(image_url, meta={'image_save_path':item['image_save_path']}, dont_filter=True)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]  # ok判断是否下载成功
        if not image_path:
            print('Item contains no images')
            # raise DropItem("Item contains no images")
        return item

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

暂无简介

0 文章

0 评论

510 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

胡图图

文章 0 评论 0

zt006

文章 0 评论 0

z祗昰~

文章 0 评论 0

冰葑

文章 0 评论 0

野の

文章 0 评论 0

天空

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文