两个pdf文件的比较
我需要比较两个几乎相似的文件的内容,并突出显示相应 pdf 文件中的不同部分。我正在使用 pdfbox。请至少帮助我理解逻辑。
I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您更喜欢带有 GUI 的工具,可以尝试这个:diffpdf 。它是由 Mark Summerfield 编写的,由于它是用 Qt 编写的,因此应该可以在(或者应该可以构建) Qt 运行的所有平台。
这是屏幕截图:data:image/s3,"s3://crabby-images/3c0bf/3c0bf9b9a83583fbce400cfd33a92d11dc4a1258" alt="在此处输入图像描述"
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:data:image/s3,"s3://crabby-images/3c0bf/3c0bf9b9a83583fbce400cfd33a92d11dc4a1258" alt="enter image description here"
您可以在 Linux 上使用 shell 脚本执行相同的操作。该脚本包含 3 个组件:
compare
命令pdftk
实用程序将其转换为 DOS/Windows 的
.bat
批处理文件相当容易...以下是构建块:
pdftk
使用此命令将多页 PDF 文件拆分为多个单页 PDF:
比较
使用此命令为每个页面创建一个“差异”PDF 页面:
请注意,
compare
是 ImageMagick 的一部分。但对于 PDF 处理,它需要 Ghostscript 作为“代理”,因为它本身无法执行此操作。再次,pdftk
现在您可以再次使用
pdftk
连接您的“差异”PDF 页面:Ghostscript
Ghostscript 会自动将元数据(例如当前日期+时间)插入到其 PDF 输出中。因此,这对于基于 MD5 哈希的文件比较效果不佳。
如果您想自动发现由纯白色页面组成的所有情况(这意味着:输入页面中没有明显的差异),您还可以使用
bmp256
输出设备。您可以对原始 PDF(first.pdf 和 2nd.pdf)或 diff-PDF 页面执行此操作:只需创建一个全白 BMP 页面及其 MD5sum(仅供参考),如下所示:
You can do the same thing with a shell script on Linux. The script wraps 3 components:
compare
commandpdftk
utilityIt's rather easy to translate this into a
.bat
Batch file for DOS/Windows...Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
compare
Use this command to create a "diff" PDF page for each of the pages:
Note, that
compare
is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.Once more, pdftk
Now you can again concatenate your "diff" PDF pages with
pdftk
:Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the
bmp256
output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:Just create an all-white BMP page with its MD5sum (for reference) like this:
我自己也遇到了这个问题,我发现最快的方法是使用 PHP 及其对 ImageMagick (Imagick) 的绑定。
当然,您需要先安装 ImageMagick 绑定:
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
Of course, you need to install the ImageMagick bindings first:
我想出了一个使用 apache pdfbox 来比较 pdf 文件的 jar - 这可以逐像素比较 &突出差异。
检查我的博客:http://www.testautomationguru。 com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ 例如 &下载。
获取页数
以纯文本形式获取页面内容
从 PDF 中提取附加图像
将 PDF 页面存储为图像
以文本模式比较 PDF 文件(更快 - 但不比较 PDF 中的格式、图像等)
比较 PDF二进制模式下的文件(较慢 - 逐像素比较 PDF 文档 - 突出显示 pdf 差异并将结果存储为图像)
I have come up with a jar using apache pdfbox to compare pdf files - this can compare
pixel by pixel
& highlight the differences.Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
To get page content as plain text
To extract attached images from PDF
To store PDF pages as images
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
为了在 macOS Monterey(即版本 12)上比较 PDF,我可以安装 diff-pdf使用自制程序,然后运行它。
--view
选项对我不起作用,但--output-diff
可以。To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.
The
--view
option didn't work for me, but the--output-diff
did.