来自 PDFS 的高分辨率图像
我正在开发一个项目,需要从多页 PDF 中提取每页 TIFF。 PDF 只包含图像,每页有一个图像(我相信它们是用某种复印机/扫描仪制作的,但尚未证实这一点)。然后,TIFF 用于创建文档的其他几个衍生版本,因此分辨率越高越好。
我找到了两个食谱,两者都有有用的方面,但都不是理想的。希望有人可以帮助我调整其中之一,或者提供第三种选择。
配方 1,pdfimages 和 ImageMagick:
首先执行以下操作:
$ pdfimages $MY_PDF.pdf foo"
生成多个 .pbm
文件(名为 foo-000.pbm
、foo -001.pbm
) 等。
然后对于每个 *.pbm
执行以下操作:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
优点:生成的 TIFF 的长边长度为 3300+ 像素尺寸,(-调整大小只是为了标准化所有内容)
缺点:页面的方向丢失了,并且它们旋转了不同的方向(它们遵循逻辑模式,所以它们可能是送入扫描仪的方向? )。
秘诀 2 Imagemagick 独奏:
convert +adjoin $MY_PDF.pdf pages.tif
这为我提供了每页 TIFF(pages-0.tif
、pages-1.tif
等)。
优点:方向保持不变!
缺点:生成的文件的长尺寸 < 800 px,太小了,没有什么用处,而且看起来好像应用了一些压缩。
如何放弃 PDF 中图像流的缩放,但保留方向? ImageMagick 中是否还缺少一些我所缺少的魔法?完全是别的什么吗?
I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.
I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.
Recipe 1, pdfimages and ImageMagick:
First do:
$ pdfimages $MY_PDF.pdf foo"
Which results in several .pbm
files (named foo-000.pbm
, foo-001.pbm
), etc.
Then for each *.pbm
do:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)
Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).
Recipe 2 Imagemagick solo:
convert +adjoin $MY_PDF.pdf pages.tif
This gives me a TIFF per page (pages-0.tif
, pages-1.tif
, etc.).
Pro: Orientation stays!
Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.
How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
很抱歉这个老话题的噪音,但谷歌把我这里作为最重要的结果之一,它可能需要其他结果,所以我想我应该发布我在这里找到的TO问题的解决方案:http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
简而言之:你必须告诉 ImageMagick 它应该以什么密度扫描PDF。
因此
convert -密度 600x600 foo.pdf foo.png
将告诉 ImageMagick 将 PDF 视为具有 600dpi 分辨率,从而输出更大的 PNG。就我而言,生成的 foo.png 的大小为 5000x6600px。您可以选择添加-resize 3000x3000
或您需要的任何尺寸,它将被缩小。请注意,只要 PDF 文件中只有矢量图像或文本,密度就可以根据需要设置得尽可能高。如果 PDF 包含光栅化图像,如果将其设置为高于这些图像的 dpi,那么它看起来会不太好,令人惊讶! :)
克里斯
Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
In Short: You have to tell ImageMagick at which density it should scan the PDF.
so
convert -density 600x600 foo.pdf foo.png
will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add-resize 3000x3000
or whatever size you require and it will be scaled down.Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)
Chris
我想分享我的解决方案......它可能并不适合每个人,但由于没有其他任何解决方案,也许它会对其他人有所帮助。我最终选择了问题中的第一个选项,即使用 pdfimages 来获取各个方向旋转的大图像。然后,我找到了一种使用 OCR 和字数统计来猜测方向的方法,这使我从(估计的)25% 准确旋转到了 90% 以上。
流程如下:
pdfimages
(apt-get install poppler-utils)获取一组pbm文件(下面未显示)。
。YMMV。我的文件是双调且高度文本化的。源图像的长边平均为 3300 像素。我无法谈论灰度或彩色或包含大量图像的文件。我的大多数源 PDF 都是旧影印件的不良扫描件,因此使用更干净的文件,准确性可能会更好。在旋转过程中使用
-despeckle
没有任何区别,而且速度显着减慢 (~5×)。我选择 orad 是为了速度而不是准确性,因为我只需要粗略的数字并且放弃了 OCR。回复:性能,我的没什么特别的 Linux 台式机每秒可以运行整个脚本大约 2-3 个文件。下面是一个简单的 bash 脚本的实现:
I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use
pdfimages
to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.The flow is as follows:
pdfimages
(apt-get install poppler-utils) to get a set of pbmfiles (not shown below).
YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using
-despeckle
during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.Here's the implementation in a simple bash script: