提取 PDF 的目录?

发布于 2024-08-24 16:37:11 字数 111 浏览 15 评论 0原文

我正在 SWFTools 和 XPDF 的帮助下将 pdf 提取为图像/swf 和文本。我在 PDF 脚本中运行这些。

但现在我想更进一步,尝试从 PDF 中获取 TOC 是否可以提取此信息?

I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.

But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

云朵有点甜 2024-08-31 16:37:11

我尝试了 dump.pdf -T,但它对某些 PDF 文件不起作用。

MuPDF 还有另一个名为 mutool 的工具,是我刚刚发现的。我不知道这是否比 dump.pdf 更好,但处理 PDF 文件 dump.pdf 会引发错误。

以下是如何使用 mutool

mutool show {your-pdf-file} Outline

MuPDF提取目录

I tried dump.pdf -T, but it did not work on some PDF files.

There is another tool from MuPDF named mutool, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.

Here's how to extract TOC with mutool

mutool show {your-pdf-file} outline

MuPDF

|煩躁 2024-08-31 16:37:11

我通过一点搜索发现了这个。看起来很有前途。

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer /index.html

注意:该工具基于 Python,但您应该能够通过 shell 访问来使用该工具。或者,您可以从源代码本身收集一些有用的信息,因为该项目是开源的。

从网站:

dumppdf.py

dumppdf.py 以伪 XML 格式转储 PDF 文件的内部内容。该程序主要用于调试目的,但也可以提取一些有意义的内容(例如图像)。

示例:

$ dumppdf.py -a foo.pdf
(转储所有标头和内容,流对象除外)

$ dumppdf.py -T foo.pdf
(转储目录)

$ dumppdf.py -r -i6 foo.pdf >图片.jpeg
(提取 JPEG 图像)

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

From the Site:

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

Examples:

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
開玄 2024-08-31 16:37:11

或者,您可以使用 MuPDF,这是一个用 C 编写的相当轻量但完整的 PDF 实现。在 在 apps/ 子目录中,您将找到一些可以查看、转储和提取 PDF 文件信息的工具。与 xpdf 相比,我更喜欢 MuPDF,因为它得到积极维护并且具有更好的 PDF 支持。

否则,总有 Poppler 实际上是基于 xpdf 的。开发人员将其代码移植到 C++。因此,它的性能比其前身更差。与 MuPDF 相比,Poppler 似乎功能稍微多一些,但代码却复杂得多。

对于您的目的来说,MuPDF 应该足够了。您可以根据 apps/ 中提供的示例代码拼凑出一个简单的应用程序,该应用程序可以提取您需要的所有信息,而无需依赖外部应用程序。

Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the apps/ subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.

Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.

For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in apps/ that extracts all the information you need without relying on external applications.

萌面超妹 2024-08-31 16:37:11

我认为查看 PHP 的 PDFLib 将是一个非常好的起点。如果向下滚动,您将看到大量用户发布的用于将 PDF 转换为 HTML 或 PDF 转换为文本的解决方案。转换后,一个相对简单的匹配函数可以提取标记的目录项并将它们放入一个数组中,然后您可以根据需要进行操作。

这篇 StackOverflow 帖子 还提供了更多解决方案。

希望这有帮助。

I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.

This StackOverflow post also has some more solutions.

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文