如何从PDF中提取格式化文本内容
如何从 PDF 中提取文本内容(而不是图像),同时(大致)保持 Google 文档那样的样式和布局?
How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
要从 PDF 中提取文本并获取其位置,您可以使用 PDFMiner。 PDFMiner 还可以直接以 HTML 格式导出 PDF,保持文本处于正确的位置。
我不知道你的用例,但是这样做时你可能会遇到很多问题,因为 PDF 确实是面向演示而不是面向内容,文本流不是连续的。所以,如果你想让文本可编辑,这并不是一件容易的事。
To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.
I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.
您是否尝试过 pyPDF 或 ReportLab PDF 库?我个人没有使用过它们,但你可以尝试一下。 这里也很有用
Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too
Xpdf 有一个名为 PDFtoText 的实用程序,它做得很好。 http://foolabs.com/xpdf/download.html
Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html
如果你想像 Google 一样做到这一点:
Google 将 PDF 转换为图像,然后用 JavaScript 突出显示区域覆盖图像(以前是文本)(这类似于 Voodoo 魔法)。当您用光标滚动这些区域时,这些区域似乎是文本,但事实并非如此。这可能对你了解没有帮助,但他们就是这样做的。如果您想对其进行逆向工程,可以从 https://www.mercurial-scm.org/< /a> 在主页上,他们使用 JavaScript 执行相同的操作,使文本可突出显示和可复制。您可以从 PDF 中提取文本,并通过其他答案中提到的库在页面中找到它的位置。然后,您可以使用相同样式的 JavaScript 区域覆盖提取的文件图像。
If you want to do it just like Google:
Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.
如果您不打算使用 python 执行此操作,Ghostscript 可以为您执行此操作。查看 pdf2ascii(GS 附带的脚本)来获取纯文本。样式更加复杂,因为可以通过几种不同的方式指定它们。
If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.
Acrobat Professional 可以完成这项工作。在“文件”菜单中,选择导出。然后,选择文本。
Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.