与 Ruby 比较 PDF 内容
我正在编写一个 Ruby 脚本/应用程序,它可以帮助我将 LaTeX 编译为(至少)PDF。我希望它具有的一个功能是它应该迭代运行 pdflatex 直到 PDF 收敛(我猜它应该如此)。
这个想法是使用指纹将一次迭代中生成的 PDF 与前一次迭代中生成的 PDF 进行比较。特别是,我目前使用 Digest::MD5.file(.)
。
现在的问题是这永远不会收敛。 (希望)罪魁祸首是 PDF 的时间戳,它至少由 pdflatex 设置为秒。由于 pdflatex 的运行通常需要超过一秒的时间,因此结果会不断变化。也就是说,我希望 PDF 在某个点之后等于时间戳。这个假设可能是错误的;提示表示赞赏。
对此我能做什么?到目前为止我的基本想法:
- 使用能够完成这项工作的库
- 剥离元数据并仅散列 PDF 内容
- 在比较之前用固定值覆盖时间戳
您有更多想法甚至解决方案吗?解决方案应仅使用在 Linux 上运行的免费软件。首选仅使用 Ruby,但使用外部软件也是完全可以接受的。
顺便说一句,我并不完全知道 PDF 是如何编码的,但我怀疑仅仅比较所包含的文本对我来说不起作用,因为在以后的迭代中只有图形或链接可能会发生变化。
可能相关:
- 如何比较两个 PDF 文件?(凌乱、基于文本或专有解决方案)
- 功能性 PDF 测试(使用 Java 库;不清楚是否适合工作)
I am in the process of writing a Ruby script/app that helps me compiling LaTeX to (at least) PDF. One feature I want it to have is that it should run pdflatex
iteratively until the PDF converges (as it should, I guess).
The idea is to compare the PDF generated in one iteration against the one from the former iteration using their fingerprints. In particular, I currently use Digest::MD5.file(.)
.
The problem now is that this never converges. A (The, hopefully) culprit is the PDF's timestamp that is set to the seconds at least by pdflatex
. Since runs of pdflatex
take typically longer than one second, the result keeps changing. That is, I expect the PDF's to be equal up to the timestamp(s) after some point. This assumption might be wrong; hints appreciated.
What can I do about this? My basic ideas so far:
- Use a library capable of doing the job
- Strip meta data away and only hash PDF content
- Overwrite timestamps by a fixed value before comparing
Do you have more ideas or even solutions? Solutions should only use free software that runs on Linux. Such that only use Ruby are preferred, but using external software is perfectly acceptable.
By the way, I do not exactly know how PDF is encoded but I suspect that merely comparing the contained text won't work for me since only graphics or links might change in later iterations.
Possibly related:
- How to compare two PDF files? (Messy, text-based or proprietary solutions)
- Functional PDF Testing (Uses a Java library; not clear wether it is up to the job)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这可能不是最可靠的解决方案,但它对我有用:
或者来自 Ruby
这会在删除导致假定相同的 PDF 不同的行后计算 PDF 的哈希值。
YMMV,取决于您的 PDF 创建者。要找出您需要删除的其他行,请使用
This is probably not the most bullet-proof solution, but it works for me:
or from Ruby
This computes the PDF's hash after dropping the lines that cause supposedly identical PDFs to differ.
YMMV, depending on your PDF creator. To find out what other lines you need to drop, use
[免责声明:我是 Identikal 的作者]
对于一个项目,我们需要在纯 Ruby 中比较两个 PDF。最终编写了一个名为 identikal 的 gem。此 gem 比较两个未加密 PDF 文件,如果相同则返回
true
,否则返回false
。安装 gem 后,您可以比较两个 PDF,如下所示:
[Disclaimer: I'm the author of Identikal]
For a project we had a requirement to compare two PDFs in pure Ruby. Ended up writing a gem called identikal. This gem compares two unencrypted PDF files and returns
true
if they are identical andfalse
otherwise.Once you install the gem you can compare two PDFs as shown below:
这不是您问题的答案,但您熟悉
latexmk< /代码>
?它是一个 Perl 脚本,它完全符合您的要求,但以一种非常不同的方式实现它。它通过检查每次 tex 运行留下的所有不同 .log 和 .aux 文件来实现这一点,然后启发式地了解每种情况下需要发生什么(这可能比简单地重新执行更复杂)运行
tex
- 可能还需要运行mkindex
或xindy
)。您可以模仿它的用法(尽管对于 3546 sloc,我并不特别推荐它),或者只是从您的 Ruby 脚本/应用程序中调用它。
This isn't an answer to your question, but are you familiar with
latexmk
? It's a perl script that does exactly what you're after, but achieves it in a very different way. It does so by examining all the different .log and .aux files left around from eachtex
run, and then has heuristics about what needs to happen in each case (which may be more complicated than simply re-runningtex
--mkindex
orxindy
may need to be run, as well).You could either mimic its usage (although with 3546 sloc, I don't particularly recommend it) or simply call it from your Ruby script/app.
由于 Latex 运行无法访问其先前的运行,并且仅依赖于(除了系统参数(例如当前时间))生成的文本文件(例如 tex、aux、bib...),因此生成的结果一旦所有这些文本文件收敛,.pdf 文件就会收敛(忽略对系统参数 sudh 的时间依赖性)。
简而言之,您应该检查文本文件(tex、aux、bib...)的收敛性,而不是 pdf 文件的收敛性。
Since a latex run does not have access to its previous runs, and is only dependent, (besides system parameters such as the current time), on the text files generated (such as tex, aux, bib, ...), the resulting pdf file converges once all those text files converges (disregarding dependency on system paramters sudh as time).
In short, you should check the convergence of the text files (tex, aux, bib, ...) rather than the convergence of the pdf file.