当前位置：文江博客话题详情

提取 PDF 的目录？

发布于 2024-08-24 16:37:11 字数 111 浏览 15 评论 0原文

我正在 SWFTools 和 XPDF 的帮助下将 pdf 提取为图像/swf 和文本。我在 PDF 脚本中运行这些。

但现在我想更进一步，尝试从 PDF 中获取 TOC 是否可以提取此信息？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云朵有点甜 2024-08-31 16:37:11

我尝试了 dump.pdf -T，但它对某些 PDF 文件不起作用。

MuPDF 还有另一个名为 mutool 的工具，是我刚刚发现的。我不知道这是否比 dump.pdf 更好，但处理 PDF 文件 dump.pdf 会引发错误。

以下是如何使用 mutool

mutool show {your-pdf-file} Outline

MuPDF提取目录

回复收藏 0 原文

|煩躁 2024-08-31 16:37:11

我通过一点搜索发现了这个。看起来很有前途。

PDFMiner： http://www.unixuser.org/~euske/python/pdfminer /index.html

注意：该工具基于 Python，但您应该能够通过 shell 访问来使用该工具。或者，您可以从源代码本身收集一些有用的信息，因为该项目是开源的。

从网站：

dumppdf.py
dumppdf.py 以伪 XML 格式转储 PDF 文件的内部内容。该程序主要用于调试目的，但也可以提取一些有意义的内容（例如图像）。
示例：
$ dumppdf.py -a foo.pdf
（转储所有标头和内容，流对象除外）

$ dumppdf.py -T foo.pdf
（转储目录）

$ dumppdf.py -r -i6 foo.pdf >图片.jpeg
（提取 JPEG 图像）

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

From the Site:

dumppdf.py
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
Examples:
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

回复收藏 0 原文