Ruby：读取 PDF 文件

发布于 07-17 09:14 字数 439 浏览 12 评论 0 原文

我正在寻找一种快速可靠的方法来在 Ruby（在 Linux 和 OSX 上）中读取/解析大型 PDF 文件。

到目前为止，我已经找到了相当古老且简单的 PDF-toolkit （a pdftotext-wrapper) 和 PDF 阅读器，无法读取我的大部分文件。尽管这两个库提供了我正在寻找的功能。

我的问题：我错过了什么吗？有没有更适合（更快、更可靠）来解决我的问题的工具？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

素衣风尘叹 2024-07-24 09:14:26

您可能会发现 Docsplit 很有用：

Docsplit 是一个命令行实用程序和 Ruby 库，用于将文档拆分为各个组成部分：可搜索的 UTF-8 纯文本、任何格式的页面图像或缩略图、PDF、单页和文档元数据（标题、作者、页数...)

回复收藏 0 原文

半衬遮猫 2024-07-24 09:14:26

在尝试了不同的方法之后，我现在正在使用 PDF-Toolkit 。它相当老旧，但速度快、稳定且可靠。此外，它确实不需要是新的，因为它只是包装了 xpdf 命令行实用程序。

回复收藏 0 原文

酒浓于脸红 2024-07-24 09:14:26

您可以使用 JRuby 和 Java PDF 库解析器，例如 ApachePDFBox (https://www.ohloh.net/ p/pdfbox）。另请参阅 http://java-source.net/open-source/pdf-libraries< /a>.

回复收藏 0 原文

还在原地等你 2024-07-24 09:14:26

您查看过 CombinePDF 库吗？

它是一个纯 ruby 解决方案，允许进行一些 PDF 操作，例如提取页面、将一个 PDF 页面覆盖在另一个页面上、页码编号、编写基本文本和表格等。

以下是在现有 PDF 文件中添加徽标的示例。该示例读取 PDF 文件，提取一页用作图章，然后为另一个 PDF 文件添加图章。

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

您还可以标记文本、页码或使用：

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

它并不适合复杂的操作，但它补充了大多数 PDF 创作库，并允许您使用 PDF 模板，而不是从头开始编写整个内容。

Did you have a look at the CombinePDF library?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

回复收藏 0 原文

转身以后 2024-07-24 09:14:26

以下是一些选项：

http://en.wikipedia.org/wiki/List_of_PDF_software

从该链接，并搜索 sourceforge，有几个命令行实用程序可以执行您想要的操作，例如： http:// pdftohtml.sourceforge.net/

根据您的要求和 PDF 的外观，您可以考虑使用 Google Docs API（上传 PDF，然后将其作为文本下载），或者也可以尝试类似 gocr。过去我很幸运地用 gocr 解析图像文本，你只需要跳到 shell 就可以做到这一点，就像 gocr -iwhatever.pdf （我认为它适用于 PDF）。

所有这些的缺点是它们不是纯 Ruby 实现，但许多优秀（且免费）的 OCR 项目似乎都是通过这种方式完成的。