当前位置：文江博客话题详情

两个pdf文件的比较

发布于 2024-11-23 17:28:58 字数 70 浏览 7 评论 0原文

我需要比较两个几乎相似的文件的内容，并突出显示相应 pdf 文件中的不同部分。我正在使用 pdfbox。请至少帮助我理解逻辑。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如日中天 2024-11-30 17:28:58

如果您更喜欢带有 GUI 的工具，可以尝试这个：diffpdf 。它是由 Mark Summerfield 编写的，由于它是用 Qt 编写的，因此应该可以在（或者应该可以构建） Qt 运行的所有平台。

这是屏幕截图：在此处输入图像描述

回复收藏 0 原文

撑一把青伞 2024-11-30 17:28:58

您可以在 Linux 上使用 shell 脚本执行相同的操作。该脚本包含 3 个组件：

ImageMagick 的 compare 命令
pdftk 实用程序
Ghostscript

将其转换为 DOS/Windows 的 .bat 批处理文件相当容易...

以下是构建块：

pdftk

使用此命令将多页 PDF 文件拆分为多个单页 PDF：

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

比较

使用此命令为每个页面创建一个“差异”PDF 页面：

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

请注意， compare 是 ImageMagick 的一部分。但对于 PDF 处理，它需要 Ghostscript 作为“代理”，因为它本身无法执行此操作。

再次，pdftk

现在您可以再次使用 pdftk 连接您的“差异”PDF 页面：

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript 会自动将元数据（例如当前日期+时间）插入到其 PDF 输出中。因此，这对于基于 MD5 哈希的文件比较效果不佳。

如果您想自动发现由纯白色页面组成的所有情况（这意味着：输入页面中没有明显的差异），您还可以使用 bmp256 输出设备。您可以对原始 PDF（first.pdf 和 2nd.pdf）或 diff-PDF 页面执行此操作：

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

只需创建一个全白 BMP 页面及其 MD5sum（仅供参考），如下所示：

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

You can do the same thing with a shell script on Linux. The script wraps 3 components:

ImageMagick's compare command
the pdftk utility
Ghostscript

It's rather easy to translate this into a .bat Batch file for DOS/Windows...

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

Once more, pdftk

Now you can again concatenate your "diff" PDF pages with pdftk:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

Just create an all-white BMP page with its MD5sum (for reference) like this:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

回复收藏 0 原文

つ低調成傷 2024-11-30 17:28:58

我自己也遇到了这个问题，我发现最快的方法是使用 PHP 及其对 ImageMagick (Imagick) 的绑定。

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

当然，您需要先安装 ImageMagick 绑定：

sudo apt-get install php5-imagick # Ubuntu/Debian

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

Of course, you need to install the ImageMagick bindings first:

sudo apt-get install php5-imagick # Ubuntu/Debian

回复收藏 0 原文

陪你到最终 2024-11-30 17:28:58

我想出了一个使用 apache pdfbox 来比较 pdf 文件的 jar - 这可以逐像素比较 &突出差异。

检查我的博客：http://www.testautomationguru。 com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ 例如 &下载。

获取页数

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

以纯文本形式获取页面内容

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

从 PDF 中提取附加图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

将 PDF 页面存储为图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

以文本模式比较 PDF 文件（更快 - 但不比较 PDF 中的格式、图像等）

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

比较 PDF二进制模式下的文件（较慢 - 逐像素比较 PDF 文档 - 突出显示 pdf 差异并将结果存储为图像）

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.

Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.

To get page count

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

To get page content as plain text

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

To extract attached images from PDF

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

To store PDF pages as images

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

回复收藏 0 原文