我想将 PDF 转换为图像,以便可以对其进行 OCR。但在转换过程中质量会下降。
使用Python将PDF转换为图像(JPG/PNG)似乎有两种主要方法 - pdf2image 和 ImageMagick/魔杖。
#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
for page in pages:
page.save("page.jpg", 'JPEG')
#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
img.compression_quality = 100
img.save(filename="page.jpg")
但如果我只是在 Mac 上截取 PDF 的屏幕截图,质量会比使用任何一种 Python 转换方法都要高。
看到这一点的一个好方法是在生成的图像上运行 Tesseract OCR - 两种 Python 方法都给出平均结果,而屏幕截图给出完美结果。 (我已经尝试过 PNG 和 JPG。)
假设我有无限的时间、计算能力和存储空间。我只对图像质量和 OCR 输出感兴趣。完美的图像触手可及,但却无法在代码中生成,这是令人沮丧的。
这是怎么回事?有没有更好的方法来转换 PDF?有没有办法可以更直接地控制?是什么让屏幕截图比实际转换效果更好?
I am tying to convert a PDF to an image so I can OCR it. But the quality is being degraded during the conversion.
There seem to be two main methods for converting a PDF to an image (JPG/PNG) with Python - pdf2image and ImageMagick/Wand.
#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
for page in pages:
page.save("page.jpg", 'JPEG')
#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
img.compression_quality = 100
img.save(filename="page.jpg")
But if I simply take a screenshot of the PDF on a Mac, the quality is higher than using either Python conversion method.
A good way to see this is to run Tesseract OCR on the resulting images - both Python methods give average results, whereas the screenshot gives perfect results. (I've tried both PNG and JPG.)
Assume I have infinite time, computing power and storage space. I am only interested in image quality and OCR output. It's frustrating to have the perfect image just within reach, but not be able to generate it in code.
What is going on here? Is there a better way to convert a PDF? Is there a way I can get more direct control? What is a screenshot doing such a better job than an actual conversion?
发布评论
评论(1)
您可以使用
PyMuPDF
并设置您想要的 dpi:You can use
PyMuPDF
and set the dpi you want: