pdf2image 将多页 PDF 转换为图像返回所有图像的最后一页

发布于 2025-01-12 23:43:44 字数 1163 浏览 4 评论 0原文

因此，当我使用 pdf2image python 导入，并将多页 PDF 传递到 convert_from_bytes() 或 convert_from_path() 方法时，输出数组确实包含多个图像 - 但所有图像都是最后一个 PDF 页面的图像（而我预计每个图像代表 PDF 页面之一）。

输出看起来像这样：

知道为什么会发生这种情况吗？我在网上找不到任何解决方案。我发现一些模糊的建议，可能会使用 use_cropbox 参数，但修改它没有效果。

def convert(opened_file)
    # Read PDF and convert pages to PPM image objects
    try:
        _ppm_pages = self.pdf2image.convert_from_bytes(
            opened_file.read(),
            grayscale = True
        )
    except Exception as e:
        print(f"[CreateJPEG] Could not convert PDF pages to JPEG image due to error: \n    '{e}'")
        return

    # Do stuff with _ppm_pages
    for img in _ppm_pages:
        img.show() # ...all images in that list are of the last page

有时输出是一个空的 1x1 图像，我也没有找到原因。因此，如果您知道那是什么，请告诉我！

提前致谢，西蒙

编辑：添加代码。

编辑：所以，当我在随机笔记本中尝试这个时，它实际上工作得很好。

我已经删除了我在原始代码中使用的一些弯路，现在它可以工作了。仍然不确定根本原因是什么...尽管如此

，还是感谢大家的帮助！

原文

So when I use the pdf2image python import, and pass a multi page PDF into the convert_from_bytes()- or convert_from_path() method, the output array does contain multiple images - but all images are of the last PDF page (whereas I would've expected that each image represented one of the PDF pages).

The output looks something like this:

Any idea on why this would occur? I can't find any solution to this online. I've found some vague suggestion that the use_cropbox argument might be used, but modifying it has no effect.

def convert(opened_file)
    # Read PDF and convert pages to PPM image objects
    try:
        _ppm_pages = self.pdf2image.convert_from_bytes(
            opened_file.read(),
            grayscale = True
        )
    except Exception as e:
        print(f"[CreateJPEG] Could not convert PDF pages to JPEG image due to error: \n    '{e}'")
        return

    # Do stuff with _ppm_pages
    for img in _ppm_pages:
        img.show() # ...all images in that list are of the last page

Sometimes the output is an empty 1x1 image, instead, which I also haven't found a reason for. So if you have any idea what that is about, please do let me know!

Thanks in advance,
Simon

EDIT: Added code.

EDIT: So, when I try this in a random notebook, it actually works fine.

I've removed a few detours I used in my original code, and now it works. Still not sure what the underlying reason was though...

All the same, thanks for your help, everyone!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泅渡 2025-01-19 23:43:44

我现在正在使用这个......

from pdf2image import convert_from_path

imgSet = convert_from_path(pathToPDF, 500)

这给了我 imgSet 中的图像列表

I'm using this right now....

from pdf2image import convert_from_path

imgSet = convert_from_path(pathToPDF, 500)

That gives me a list of images within imgSet

回复收藏 0 原文

从此见与不见 2025-01-19 23:43:44

我想您必须按照包的单元测试中的描述执行类似的操作。

        with open("./tests/test.pdf", "rb") as pdf_file:
            images_from_bytes = convert_from_bytes(pdf_file.read(), fmt="jpg")
            self.assertTrue(images_from_bytes[0].format == "JPEG")

I guess you have to do something like this as described in the unit tests of the package.

        with open("./tests/test.pdf", "rb") as pdf_file:
            images_from_bytes = convert_from_bytes(pdf_file.read(), fmt="jpg")
            self.assertTrue(images_from_bytes[0].format == "JPEG")

回复收藏 0 原文

~没有更多了~