如何打开 PDF 原始文件?
一段时间以来,我一直想查看 PDF 的内部结构,例如它的原始源代码,以便我可以查看它。有什么办法吗?
I've been wanting to see the insides of a PDF for a while, like, the raw source code of it so I can look at it. Any way of doing that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
除非您也了解其内部结构,否则查看 PDF 的原始代码对您没有多大帮助。您应该为自己获取一份 官方 PDF 参考(下载 PDF),您应该阅读一些介绍性文章,例如
此 [消失] 或这个 首先。即使经过这样的准备,您在查看原始代码时也不会发现有多大用处。因为 PDF 通常会包含“过滤”的部分(即:压缩)。
如何查看“原始”二进制部分背后的真实 PDF 源
Jay Birkenbilt 的 qpdf 是一个非常有用的命令行工具(可用于 Linux、Mac OSX、Windows,并作为源代码,在开源 Artistic 许可证下),它可以解压大多数过滤内容并重新组织内部结构,从而为您提供更多帮助更多见解进入其中(所有对象都按数字顺序排列等)。实现此目的的命令行是:
另一个用于查看 PDF 的有用且免费的工具(GPL 许可,但仅限 Linux)当然是 PDF编辑。这个甚至带有 GUI(如果您愿意的话),同时仍然允许您访问内部结构和“原始”PDF 代码。
Looking at the raw code of PDFs will not serve you much unless you also have an idea about its internal structure. You should get yourself a copy of the official PDF reference (download PDF), and you should have read some introductionary article such as
this [gone] orthis to begin with.Even after such a preparation, you'll not discover much useful when staring at the raw code. Because PDFs usually will contain parts which are "filtered" (that means: compressed).
How to look at the real PDF source behind the 'raw' binary parts
Jay Birkenbilt's qpdf is a very useful commandline tool (available for Linux, Mac OSX, Windows, and as source code, under the open source Artistic License), which can unpack most filtered content and re-organize the internal structure in a way that gives you much more insight into it (all objects are numerically ordered, etc.). The commandline to achieve this is:
Another useful and free tool (GPL licensed, but Linux-only AFAIK) to look into PDFs is of course PDFEdit. This one even comes with a GUI (if you prefer that), while still allowing you access to the internal structure and "raw" PDF code.
如果目的只是查看文件,那么任何简单的文本编辑器都可以,例如记事本。 PDF 只是一种基于文本的格式,包括嵌入的内容字节流。原始 PDF 看起来像这样:
你看到的是基本的 COS 对象,如名称、字典、流等。所有对象均在 PDF 32000 标准中进行描述,请参阅7.3 对象部分。
If the purpose is just to look into the file, then any simple text editor will do, ex, Notepad. PDF is just a text based format, including embedded content byte streams. Raw PDF looks like this:
What you see are basic COS objects like name, dictionary, stream and so on. All objects are described in PDF 32000 standard, see section 7.3 Objects.
使用十六进制编辑器。当然,除非您了解 PDF 规范< /a> (PDF, 8.6 MB),您不会认得太多。
Use a Hex editor. Of course, unless you know the PDF specification (PDF, 8.6 MB), you won't recognize much.
除了 qpdf 工具转换为 postscript 可能会有帮助。
PDF 是 PS 的子集。通常它很容易弄清楚,例如图表的标签在哪里。您可以使用 pdf2ps 或调用 Ghostscript
当您使用 pdflatex 生成 PDF 时,您可以使用选项禁用压缩。这使得 PDF 更具可读性。
In addition to the qpdf tool conversion into postscript might be helpful.
PDF is a subset of PS. Usually its quite easy to figure out, e.g. where the labels of a graph are. You can either use pdf2ps or invoke ghostscript
When you generate your PDFs using pdflatex you can disable compression with an option. This makes the PDF more readable.
对其他答案的一些最新观察。
Adobe 不断推进其 2008 标准的开源副本,因此目前位于此处 https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
网络档案馆目前有一个副本 https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
它们应该是相同的 22,491,828 字节,因此请注意,两者都不包含任何勘误表。
pdf 可以是完美的纯 mime "text/pdf" 吗?从控制台键盘或命令行(太慢)或批处理文件生成的注释。我不会让你厌倦整个文件,但它开始像这样:
因此,带注释的 RAW PDF(请注意,我已在 cmd 文件 准备 XMP 数据部分,因此不相同)可能如下所示:-
其他人提出的许多建议,用于将二进制应用程序/PDF 解压缩为文本/PDF,有些可能是混合格式,因此仍已二进制化申请文本。
为该任务设计的 3 个最常见的工具是 qpdf(已经提到,但使用混合 QDF)PDFtk(解压缩)和 Mutool(不同的 CLI 选项),这是我最常使用的,因为在 GL GUI 中可以轻松更改输出设置。可以在 MS 记事本中修改输出,同时预览结果。
因此,任何文本编辑脚本都可以编写或编辑 PDF,甚至可以包含图形。一些应用程序可以将 RAW“二进制”PDF 转换为 RAW“文本”PDF。但是,切勿尝试在暂时使用文本 base64 RePrEx 时编辑 PDF(可能,但完全不切实际)
Some more recent observations on the other answers.
Adobe keep moving about their Open Sourced copy of the 2008 standard so currently that is here https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
The Web Archive have currently a copy here https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
They should be identical 22,491,828 bytes so beware neither includes any errata.
A pdf CAN be plain mime "text/pdf" as perfectly ? annotated generated from a console keyboard or command line (too slow) or a batch file. I won't bore you with the whole file but it starts like this:
Thus the annotated RAW PDF (note I had edited the order in the cmd file in preparation for an XMP data section, so not identical) could look like :-
Many suggestions by others for decompress binary application/PDF into text/PDF and some may be a hybrid thus still have binarized application text.
The 3 most common designed for the task are qpdf (already mentioned, but uses a hybrid QDF) PDFtk (uncompress) and Mutool (different CLI options), that's the one I play with most, as it's easy in GL GUI to change the output settings. The output can be modified in MS Notepad, whilst previewing result.
So any text editing script can write or edit a PDF even with graphics. And several applications can convert RAW "binary" PDF into RAW "textual" PDF. However never attempt to edit PDF whilst temporarily in its textual base64 RePrEx (possible, but totally impractical)