与 Ruby 比较 PDF 内容

发布于 2024-10-14 13:10:20 字数 914 浏览 1 评论 0原文

我正在编写一个 Ruby 脚本/应用程序,它可以帮助我将 LaTeX 编译为(至少)PDF。我希望它具有的一个功能是它应该迭代运行 pdflatex 直到 PDF 收敛(我猜它应该如此)。

这个想法是使用指纹将一次迭代中生成的 PDF 与前一次迭代中生成的 PDF 进行比较。特别是,我目前使用 Digest::MD5.file(.)

现在的问题是这永远不会收敛。 (希望)罪魁祸首是 PDF 的时间戳,它至少由 pdflatex 设置为秒。由于 pdflatex 的运行通常需要超过一秒的时间,因此结果会不断变化。也就是说,我希望 PDF 在某个点之后等于时间戳。这个假设可能是错误的;提示表示赞赏。

对此我能做什么?到目前为止我的基本想法:

  • 使用能够完成这项工作的库
  • 剥离元数据并仅散列 PDF 内容
  • 在比较之前用固定值覆盖时间戳

您有更多想法甚至解决方案吗?解决方案应仅使用在 Linux 上运行的免费软件。首选仅使用 Ruby,但使用外部软件也是完全可以接受的。

顺便说一句,我并不完全知道 PDF 是如何编码的,但我怀疑仅仅比较所包含的文本对我来说不起作用,因为在以后的迭代中只有图形或链接可能会发生变化。

可能相关:

I am in the process of writing a Ruby script/app that helps me compiling LaTeX to (at least) PDF. One feature I want it to have is that it should run pdflatex iteratively until the PDF converges (as it should, I guess).

The idea is to compare the PDF generated in one iteration against the one from the former iteration using their fingerprints. In particular, I currently use Digest::MD5.file(.).

The problem now is that this never converges. A (The, hopefully) culprit is the PDF's timestamp that is set to the seconds at least by pdflatex. Since runs of pdflatex take typically longer than one second, the result keeps changing. That is, I expect the PDF's to be equal up to the timestamp(s) after some point. This assumption might be wrong; hints appreciated.

What can I do about this? My basic ideas so far:

  • Use a library capable of doing the job
  • Strip meta data away and only hash PDF content
  • Overwrite timestamps by a fixed value before comparing

Do you have more ideas or even solutions? Solutions should only use free software that runs on Linux. Such that only use Ruby are preferred, but using external software is perfectly acceptable.

By the way, I do not exactly know how PDF is encoded but I suspect that merely comparing the contained text won't work for me since only graphics or links might change in later iterations.

Possibly related:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

余生一个溪 2024-10-21 13:10:21

这可能不是最可靠的解决方案,但它对我有用:

grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum

或者来自 Ruby

`grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum`.chop!

这会在删除导致假定相同的 PDF 不同的行后计算 PDF 的哈希值。

YMMV,取决于您的 PDF 创建者。要找出您需要删除的其他行,请使用

diff -a file-1.pdf file-2.pdf | less

This is probably not the most bullet-proof solution, but it works for me:

grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum

or from Ruby

`grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum`.chop!

This computes the PDF's hash after dropping the lines that cause supposedly identical PDFs to differ.

YMMV, depending on your PDF creator. To find out what other lines you need to drop, use

diff -a file-1.pdf file-2.pdf | less
淡水深流 2024-10-21 13:10:21

[免责声明:我是 Identikal 的作者]

对于一个项目,我们需要在纯 Ruby 中比较两个 PDF。最终编写了一个名为 identikal 的 gem。此 gem 比较两个未加密 PDF 文件,如果相同则返回 true,否则返回 false

安装 gem 后,您可以比较两个 PDF,如下所示:

$ identikal file_a.pdf file_b.pdf
true

[Disclaimer: I'm the author of Identikal]

For a project we had a requirement to compare two PDFs in pure Ruby. Ended up writing a gem called identikal. This gem compares two unencrypted PDF files and returns true if they are identical and false otherwise.

Once you install the gem you can compare two PDFs as shown below:

$ identikal file_a.pdf file_b.pdf
true
痴梦一场 2024-10-21 13:10:21

这不是您问题的答案,但您熟悉 latexmk< /代码>?它是一个 Perl 脚本,它完全符合您的要求,但以一种非常不同的方式实现它。它通过检查每次 tex 运行留下的所有不同 .log 和 .aux 文件来实现这一点,然后启发式地了解每种情况下需要发生什么(这可能比简单地重新执行更复杂)运行 tex - 可能还需要运行 mkindexxindy)。

您可以模仿它的用法(尽管对于 3546 sloc,我并不特别推荐它),或者只是从您的 Ruby 脚本/应用程序中调用它。

This isn't an answer to your question, but are you familiar with latexmk? It's a perl script that does exactly what you're after, but achieves it in a very different way. It does so by examining all the different .log and .aux files left around from each tex run, and then has heuristics about what needs to happen in each case (which may be more complicated than simply re-running tex -- mkindex or xindy may need to be run, as well).

You could either mimic its usage (although with 3546 sloc, I don't particularly recommend it) or simply call it from your Ruby script/app.

丶视觉 2024-10-21 13:10:21

由于 Latex 运行无法访问其先前的运行,并且仅依赖于(除了系统参数(例如当前时间))生成的文本文件(例如 tex、aux、bib...),因此生成的结果一旦所有这些文本文件收敛,.pdf 文件就会收敛(忽略对系统参数 sudh 的时间依赖性)。

简而言之,您应该检查文本文件(tex、aux、bib...)的收敛性,而不是 pdf 文件的收敛性。

  1. 创建目录 A,在其中运行 Latex。
  2. 创建目录 B,在其中保存上次 Latex 运行生成的文本文件的副本。
  3. A内运行latex
  4. 如果B中所有文件的内容与A中对应文件的内容相同,则停止。否则,将 A 中生成的所有文本文件(aux、bib、...)复制到 B,如果您知道原始 tex 文件没有更改,则排除原始 tex 文件。您还可以从复制列表中排除日志。然后,返回3。

Since a latex run does not have access to its previous runs, and is only dependent, (besides system parameters such as the current time), on the text files generated (such as tex, aux, bib, ...), the resulting pdf file converges once all those text files converges (disregarding dependency on system paramters sudh as time).

In short, you should check the convergence of the text files (tex, aux, bib, ...) rather than the convergence of the pdf file.

  1. Make directory A, where you run latex.
  2. Make directory B, where you keep a copy of the text files resulting from the previous latex run.
  3. Run latex within A
  4. If the contents of all the files in B are the same as the contents of the corresponding files in A, then stop. Otherwise, copy all the text files generated in A (aux, bib, ...) to B, excluding the original tex file if you know that it didn't change. You can also exclude log from the copy list. And then, return to 3.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文