提取 PDF 的目录?
我正在 SWFTools 和 XPDF 的帮助下将 pdf 提取为图像/swf 和文本。我在 PDF 脚本中运行这些。
但现在我想更进一步,尝试从 PDF 中获取 TOC 是否可以提取此信息?
I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.
But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我尝试了
dump.pdf -T
,但它对某些 PDF 文件不起作用。MuPDF 还有另一个名为
mutool
的工具,是我刚刚发现的。我不知道这是否比 dump.pdf 更好,但处理 PDF 文件 dump.pdf 会引发错误。以下是如何使用 mutool
mutool show {your-pdf-file} Outline
MuPDF提取目录
I tried
dump.pdf -T
, but it did not work on some PDF files.There is another tool from MuPDF named
mutool
, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.Here's how to extract TOC with mutool
mutool show {your-pdf-file} outline
MuPDF
我通过一点搜索发现了这个。看起来很有前途。
PDFMiner: http://www.unixuser.org/~euske/python/pdfminer /index.html
注意:该工具基于 Python,但您应该能够通过 shell 访问来使用该工具。或者,您可以从源代码本身收集一些有用的信息,因为该项目是开源的。
从网站:
I found this with a little bit of searching. It looks rather promising.
PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html
Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.
From the Site:
或者,您可以使用 MuPDF,这是一个用 C 编写的相当轻量但完整的 PDF 实现。在
在 apps/
子目录中,您将找到一些可以查看、转储和提取 PDF 文件信息的工具。与 xpdf 相比,我更喜欢 MuPDF,因为它得到积极维护并且具有更好的 PDF 支持。否则,总有 Poppler 实际上是基于 xpdf 的。开发人员将其代码移植到 C++。因此,它的性能比其前身更差。与 MuPDF 相比,Poppler 似乎功能稍微多一些,但代码却复杂得多。
对于您的目的来说,MuPDF 应该足够了。您可以根据
apps/
中提供的示例代码拼凑出一个简单的应用程序,该应用程序可以提取您需要的所有信息,而无需依赖外部应用程序。Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the
apps/
subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.
For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in
apps/
that extracts all the information you need without relying on external applications.我认为查看 PHP 的 PDFLib 将是一个非常好的起点。如果向下滚动,您将看到大量用户发布的用于将 PDF 转换为 HTML 或 PDF 转换为文本的解决方案。转换后,一个相对简单的匹配函数可以提取标记的目录项并将它们放入一个数组中,然后您可以根据需要进行操作。
这篇 StackOverflow 帖子 还提供了更多解决方案。
希望这有帮助。
I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.
This StackOverflow post also has some more solutions.
Hope this helps.