当前位置：文江博客话题详情

网页爬虫 Scrapy 爬虫图片

scrapy爬取图片，遇到https://demo?wx_fmt=jpeg情况，无法爬取

发布于 2022-09-06 19:29:46 字数 2989 浏览 18 评论 0

原连接：https://mmbiz.qlogo.cn/mmbiz/...
使用的是scrapy的ImagesPipeline

class ImgPipeline(ImagesPipeline):
    """
    scrapy图片处理管道
    """

    # 请求图片
    def get_media_requests(self, item, info):
        content = str(item['content'])
        match = re.findall(r'src="(http|https?://.*?)"', content)
        item['img_links'] = match
        for img_link in item['img_links']:
            yield scrapy.Request(img_link)

    # 请求完成后
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no image")
        item['img_paths'] = image_paths
        return item

异常

2017-12-22 10:06:47 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1> referred in <None>
Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 115, in get_images
    orig_image = Image.open(BytesIO(response.body))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\PIL\Image.py", line 2519, in open
    % (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001842C76EFC0>

目前分析问题出现的原因是，该链接返回的是图片的base64，scrapy不能识别

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（1）

魔法少女 2022-09-13 19:29:46

同学这个问题你解决了吗，我也遇到了相同的问题（这个网站居然24小时过后才能私信TAT）

~没有更多了~

关于作者

哆啦不做梦

暂无简介

0 文章

0 评论

24 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

已经忘了多久

文章 0 评论 0

15867725375

文章 0 评论 0

LonelySnow

文章 0 评论 0

走过海棠暮

文章 0 评论 0

轻许诺言

文章 0 评论 0

信馬由缰

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文