如何将 PDF 二进制部分转换为 ASCII/ANSI,以便我可以在文本编辑器中查看它?
大多数 PDF 在一些 ASCII 之间包含许多看起来二进制的部分。但我记得也见过 PDF,其中大体上不存在此类二进制部分,人们可以在文本编辑器中打开它们来研究它们的结构。
是否有技巧、工具或命令可以将二进制 PDF 部分转换为 ASCII/ANSI? (最好是“像啤酒一样自由”甚至“像自由一样自由”)
Most PDFs contain lots of binary looking parts in between some ASCII. But I remember also having seen PDFs where such binary parts by and large were absent, and one could open them in a text editor to study their structure.
Is there a trick, tool, or command that will convert binary PDF parts to ASCII/ANSI? (Preferably "free as in beer" or even "free as in liberty")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
[更新于 2014-10-15]
使用 Ghostscript
Ghostscript 在其源代码存储库中有一个用 PostScript 编写的小型实用程序。它的名称为
pdfinflt.ps
。如果幸运的话,它可能已经休眠在 Ghostscript 安装位置的“toolbin”子目录中。否则,请在此处获取:现在,通过 Ghostscript 解释器将其与目标输入 PDF 一起运行:
pdfinflt.ps
将(尝试)展开包含在中的所有“流”使用以下压缩过滤器/方法的 PDF:/FlateDecode
、/LZWDecode
、/ASCII85Decode
、/ASCIIHexDecode
。它不会尝试删除
/RunLengthDecode
、/CCITTFaxDecode
、/DCTDecode
、/JBIG2Decode
和/ JPXDecode
。 (压缩/二进制字体也将原封不动地传递到输出 PDF 中。)如果您有冒险精神,您可能敢于取消注释实用程序中禁用
/RunLengthDecode
、/DCTDecode 的那些行
和CCITTFaxDecode
并查看它是否仍然有效...使用
qpdf
将 PDF 转换为允许文本编辑器访问的内部格式的另一个有用工具是
qpdf
。它是一个“对 PDF 文件进行结构性、内容保留转换的命令行程序”。用法示例:
--qdf
开关强制执行的QDF
模式的输出可以整齐地组织和重新排序对象。它添加注释来跟踪原始对象 ID 和页面内容流。所有对象字典都写入“标准化”标准格式,以便于解析。--object-streams=disable
导致提取(否则无法识别)单个对象,这些对象被压缩到另一个对象的流数据中。使用
mutool
Artifex,Ghostscript,在自由开源软件许可下提供另一个可用工具:
MuPDF
。MuPDF 附带了一个命令行工具
mutool
,它还可以扩展压缩的 PDF 对象流:clean
: - 编写 PDF;-d
:解压缩所有流;-a
:ASCIIhex对所有二进制流进行编码;4,7,8,9
:选择第 4、7、8 和 9 页包含在output.pdf
中。使用
pdftk
最后,介绍如何使用
pdtk
工具解压缩 PDF 对象的流:请注意命令行中最后的
uncompress
单词。选择您最喜欢的
以上所有工具均可用于 Linux、Mac OSX、Unix 和 Windows。
对于大多数实际情况,我自己最喜欢的是
QPDF
。但是,您应该自己进行实验并比较每个建议工具的(不同)输出。然后做出你自己的选择。
[Updated 2014-10-15]
Using Ghostscript
Ghostscript has a small utility program written in PostScript in its source code repository. It's called
pdfinflt.ps
. If you are lucky, it may already slumber in a 'toolbin' subdirectory of your Ghostscript installation location. Otherwise, get it here:Now run it together with your targeted input PDF through the Ghostscript interpreter:
pdfinflt.ps
will (try to) expand all 'streams' contained in the PDF which use the following compression filters/methods:/FlateDecode
,/LZWDecode
,/ASCII85Decode
,/ASCIIHexDecode
.It will not attempt to remove
/RunLengthDecode
,/CCITTFaxDecode
,/DCTDecode
,/JBIG2Decode
and/JPXDecode
. (Compressed/binary fonts will also pass unchanged into the output PDF.)If you are in an adventurous mood, you may dare to uncomment those lines in the utility which disable
/RunLengthDecode
,/DCTDecode
andCCITTFaxDecode
and see if it still works...Using
qpdf
Another useful tool to transform a PDF into an internal format that enables text editor access is
qpdf
. It is a "command-line program that does structural, content-preserving transformations on PDF files".Example usage:
The output of the
QDF
-mode enforced by the--qdf
switch organizes and re-orders the objects neatly. It adds comments to track the original object IDs and page content streams. All object dictionaries are written into a "normalized" standard format for easier parsing.The
--object-streams=disable
causes the extraction of (otherwise not recognizable) individual objects that are compressed into another object's stream data.Using
mutool
Artifex, the creators of Ghostscript, offer another under a Free and Open Source Software license available tool:
MuPDF
.MuPDF comes with a command line tool,
mutool
, which also can expand compressed PDF object streams:clean
: re-writes the PDF;-d
: de-compresses all streams;-a
: ASCIIhex encodes all binary streams;4,7,8,9
: selects pages 4, 7, 8 and 9 for inclusion inoutput.pdf
.Using
pdftk
Last, here is how to use the
pdtk
tool to uncompress PDF object's streams:Note the final
uncompress
word in the command line.Pick your favorite
All above tools are available for Linux, Mac OSX, Unix and Windows.
My own favorite is
QPDF
for most practical cases.However, you should make your own experiments and compare the (different) output of each of the suggested tools. Then make your own pick.