来自 PDFS 的高分辨率图像

发布于 2024-12-27 00:50:13 字数 993 浏览 1 评论 0原文

我正在开发一个项目，需要从多页 PDF 中提取每页 TIFF。 PDF 只包含图像，每页有一个图像（我相信它们是用某种复印机/扫描仪制作的，但尚未证实这一点）。然后，TIFF 用于创建文档的其他几个衍生版本，因此分辨率越高越好。

我找到了两个食谱，两者都有有用的方面，但都不是理想的。希望有人可以帮助我调整其中之一，或者提供第三种选择。

配方 1，pdfimages 和 ImageMagick：

首先执行以下操作：

$ pdfimages $MY_PDF.pdf foo"

生成多个 .pbm 文件（名为 foo-000.pbm、foo -001.pbm) 等。

然后对于每个 *.pbm 执行以下操作：

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

优点：生成的 TIFF 的长边长度为 3300+ 像素尺寸，（-调整大小只是为了标准化所有内容）

缺点：页面的方向丢失了，并且它们旋转了不同的方向（它们遵循逻辑模式，所以它们可能是送入扫描仪的方向？）。

秘诀 2 Imagemagick 独奏：

convert +adjoin $MY_PDF.pdf pages.tif

这为我提供了每页 TIFF（pages-0.tif、pages-1.tif 等）。

优点：方向保持不变！

缺点：生成的文件的长尺寸 < 800 px，太小了，没有什么用处，而且看起来好像应用了一些压缩。

如何放弃 PDF 中图像流的缩放，但保留方向？ ImageMagick 中是否还缺少一些我所缺少的魔法？完全是别的什么吗？

原文

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.

I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.

Recipe 1, pdfimages and ImageMagick:

First do:

$ pdfimages $MY_PDF.pdf foo"

Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.

Then for each *.pbm do:

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)

Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).

Recipe 2 Imagemagick solo:

convert +adjoin $MY_PDF.pdf pages.tif

This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).

Pro: Orientation stays!

Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.

How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈场末日恋爱 2025-01-03 00:50:13

很抱歉这个老话题的噪音，但谷歌把我这里作为最重要的结果之一，它可能需要其他结果，所以我想我应该发布我在这里找到的TO问题的解决方案：http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

简而言之：你必须告诉 ImageMagick 它应该以什么密度扫描PDF。

因此 convert -密度 600x600 foo.pdf foo.png 将告诉 ImageMagick 将 PDF 视为具有 600dpi 分辨率，从而输出更大的 PNG。就我而言，生成的 foo.png 的大小为 5000x6600px。您可以选择添加 -resize 3000x3000 或您需要的任何尺寸，它将被缩小。

请注意，只要 PDF 文件中只有矢量图像或文本，密度就可以根据需要设置得尽可能高。如果 PDF 包含光栅化图像，如果将其设置为高于这些图像的 dpi，那么它看起来会不太好，令人惊讶！ :)

克里斯

回复收藏 0 原文

最丧也最甜 2025-01-03 00:50:13

我想分享我的解决方案......它可能并不适合每个人，但由于没有其他任何解决方案，也许它会对其他人有所帮助。我最终选择了问题中的第一个选项，即使用 pdfimages 来获取各个方向旋转的大图像。然后，我找到了一种使用 OCR 和字数统计来猜测方向的方法，这使我从（估计的）25% 准确旋转到了 90% 以上。

流程如下：

使用pdfimages（apt-get install poppler-utils）获取一组pbm
文件（下面未显示）。
对于每个文件：
1. 制作四个版本，旋转 0、90、180 和 270 度（我在代码中将它们称为“北”、“东”、“南”和“西”）。
2. 每个 OCR。字数最少的两个版本可能是正面朝上和颠倒的版本。在我迄今为止处理的一组图像中，准确率超过 99%。
3. 从字数最少的两个中，通过拼写检查运行 OCR 输出。拼写错误最少（即最容易识别的单词）的文件可能是正确的。对于我的设置，基于 500 个样本，准确率约为 93%（高于 25%）。

。YMMV。我的文件是双调且高度文本化的。源图像的长边平均为 3300 像素。我无法谈论灰度或彩色或包含大量图像的文件。我的大多数源 PDF 都是旧影印件的不良扫描件，因此使用更干净的文件，准确性可能会更好。在旋转过程中使用 -despeckle 没有任何区别，而且速度显着减慢 (~5×)。我选择 orad 是为了速度而不是准确性，因为我只需要粗略的数字并且放弃了 OCR。回复：性能，我的没什么特别的 Linux 台式机每秒可以运行整个脚本大约 2-3 个文件。

下面是一个简单的 bash 脚本的实现：

#!/bin/bash
# Rotates a pbm file in place.

# Pass a .pbm as the only arg.
file=$1

TMP="/tmp/rotation-calc"
mkdir $TMP

# Dependencies:                                                                 
# convert: apt-get install imagemagick                                          
# ocrad: sudo apt-get install ocrad                                               
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"

cp  $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file

# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"

$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text

# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
    txt=$(echo $record | $AWK '{ print $2 }')
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
    echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table

# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')

mv $rotated_file $file

# Clean up.
if [ -d $TMP ]; then
    rm -r $TMP
fi

I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.

The flow is as follows:

Use pdfimages (apt-get install poppler-utils) to get a set of pbm
files (not shown below).
For each file:
1. Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
2. OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
3. From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.

YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.

Here's the implementation in a simple bash script:

#!/bin/bash
# Rotates a pbm file in place.

# Pass a .pbm as the only arg.
file=$1

TMP="/tmp/rotation-calc"
mkdir $TMP

# Dependencies:                                                                 
# convert: apt-get install imagemagick                                          
# ocrad: sudo apt-get install ocrad                                               
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"

cp  $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file

# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"

$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text

# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
    txt=$(echo $record | $AWK '{ print $2 }')
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
    echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table

# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')

mv $rotated_file $file

# Clean up.
if [ -d $TMP ]; then
    rm -r $TMP
fi

回复收藏 0 原文

~没有更多了~