scrapy爬取图片,遇到https://demo?wx_fmt=jpeg情况,无法爬取

发布于 2022-09-06 19:29:46 字数 2989 浏览 18 评论 0

原连接:https://mmbiz.qlogo.cn/mmbiz/...
使用的是scrapy的ImagesPipeline

class ImgPipeline(ImagesPipeline):
    """
    scrapy图片处理管道
    """

    # 请求图片
    def get_media_requests(self, item, info):
        content = str(item['content'])
        match = re.findall(r'src="(http|https?://.*?)"', content)
        item['img_links'] = match
        for img_link in item['img_links']:
            yield scrapy.Request(img_link)

    # 请求完成后
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no image")
        item['img_paths'] = image_paths
        return item

异常

2017-12-22 10:06:47 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1> referred in <None>
Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 115, in get_images
    orig_image = Image.open(BytesIO(response.body))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\PIL\Image.py", line 2519, in open
    % (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001842C76EFC0>

目前分析问题出现的原因是,该链接返回的是图片的base64,scrapy不能识别

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

魔法少女 2022-09-13 19:29:46

同学这个问题你解决了吗,我也遇到了相同的问题(这个网站居然24小时过后才能私信TAT)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文