两个pdf文件的比较

发布于 2024-11-23 17:28:58 字数 70 浏览 7 评论 0原文

我需要比较两个几乎相似的文件的内容,并突出显示相应 pdf 文件中的不同部分。我正在使用 pdfbox。请至少帮助我理解逻辑。

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

如日中天 2024-11-30 17:28:58

如果您更喜欢带有 GUI 的工具,可以尝试这个:diffpdf 。它是由 Mark Summerfield 编写的,由于它是用 Qt 编写的,因此应该可以在(或者应该可以构建) Qt 运行的所有平台。

这是屏幕截图:在此处输入图像描述

If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.

Here's a screenshot:enter image description here

撑一把青伞 2024-11-30 17:28:58

您可以在 Linux 上使用 shell 脚本执行相同的操作。该脚本包含 3 个组件:

  1. ImageMagick 的 compare 命令
  2. pdftk 实用程序
  3. Ghostscript

将其转换为 DOS/Windows 的 .bat 批处理文件相当容易...

以下是构建块:

pdftk

使用此命令将多页 PDF 文件拆分为多个单页 PDF:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

比较

使用此命令为每个页面创建一个“差异”PDF 页面:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

请注意, compare 是 ImageMagick 的一部分。但对于 PDF 处理,它需要 Ghostscript 作为“代理”,因为它本身无法执行此操作。

再次,pdftk

现在您可以再次使用 pdftk 连接您的“差异”PDF 页面:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript 会自动将元数据(例如当前日期+时间)插入到其 PDF 输出中。因此,这对于基于 MD5 哈希的文件比较效果不佳。

如果您想自动发现由纯白色页面组成的所有情况(这意味着:输入页面中没有明显的差异),您还可以使用 bmp256 输出设备。您可以对原始 PDF(first.pdf 和 2nd.pdf)或 diff-PDF 页面执行此操作:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

只需创建一个全白 BMP 页​​面及其 MD5sum(仅供参考),如下所示:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

You can do the same thing with a shell script on Linux. The script wraps 3 components:

  1. ImageMagick's compare command
  2. the pdftk utility
  3. Ghostscript

It's rather easy to translate this into a .bat Batch file for DOS/Windows...

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

Once more, pdftk

Now you can again concatenate your "diff" PDF pages with pdftk:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

Just create an all-white BMP page with its MD5sum (for reference) like this:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp
つ低調成傷 2024-11-30 17:28:58

我自己也遇到了这个问题,我发现最快的方法是使用 PHP 及其对 ImageMagick (Imagick) 的绑定。

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

当然,您需要先安装 ImageMagick 绑定:

sudo apt-get install php5-imagick # Ubuntu/Debian

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

Of course, you need to install the ImageMagick bindings first:

sudo apt-get install php5-imagick # Ubuntu/Debian
陪你到最终 2024-11-30 17:28:58

我想出了一个使用 apache pdfbox 来比较 pdf 文件的 jar - 这可以逐像素比较 &突出差异。

检查我的博客:http://www.testautomationguru。 com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ 例如 &下载。


获取页数

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

以纯文本形式获取页面内容

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

从 PDF 中提取附加图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

将 PDF 页面存储为图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

以文本模式比较 PDF 文件(更快 - 但不比较 PDF 中的格式、图像等)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

比较 PDF二进制模式下的文件(较慢 - 逐像素比较 PDF 文档 - 突出显示 pdf 差异并将结果存储为图像)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.

Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.


To get page count

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

To get page content as plain text

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

To extract attached images from PDF

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

To store PDF pages as images

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
愛放△進行李 2024-11-30 17:28:58

为了在 macOS Monterey(即版本 12)上比较 PDF,我可以安装 diff-pdf使用自制程序,然后运行它。

--view 选项对我不起作用,但 --output-diff 可以。

To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.

The --view option didn't work for me, but the --output-diff did.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文