Ruby:读取 PDF 文件

发布于 07-17 09:14 字数 439 浏览 12 评论 0 原文

我正在寻找一种快速可靠的方法来在 Ruby(在 Linux 和 OSX 上)中读取/解析大型 PDF 文件。

到目前为止,我已经找到了相当古老且简单的 PDF-toolkit (a pdftotext-wrapper) 和 PDF 阅读器,无法读取我的大部分文件。 尽管这两个库提供了我正在寻找的功能。

我的问题:我错过了什么吗? 有没有更适合(更快、更可靠)来解决我的问题的工具?

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

素衣风尘叹 2024-07-24 09:14:26

您可能会发现 Docsplit 很有用:

Docsplit 是一个命令行实用程序和 Ruby 库,用于将文档拆分为各个组成部分:可搜索的 UTF-8 纯文本、任何格式的页面图像或缩略图、PDF、单页和文档元数据(标题、作者、页数...)

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

半衬遮猫 2024-07-24 09:14:26

在尝试了不同的方法之后,我现在正在使用 PDF-Toolkit 。 它相当老旧,但速度快、稳定且可靠。 此外,它确实不需要是新的,因为它只是包装了 xpdf 命令行实用程序

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

酒浓于脸红 2024-07-24 09:14:26

您可以使用 JRuby 和 Java PDF 库解析器,例如 ApachePDFBox (https://www.ohloh.net/ p/pdfbox)。 另请参阅 http://java-source.net/open-source/pdf-libraries< /a>.

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

还在原地等你 2024-07-24 09:14:26

您查看过 CombinePDF 库吗?

它是一个纯 ruby​​ 解决方案,允许进行一些 PDF 操作,例如提取页面、将一个 PDF 页面覆盖在另一个页面上、页码编号、编写基本文本和表格等。

以下是在现有 PDF 文件中添加徽标的示例。 该示例读取 PDF 文件,提取一页用作图章,然后为另一个 PDF 文件添加图章。

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

您还可以标记文本、页码或使用:

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

它并不适合复杂的操作,但它补充了大多数 PDF 创作库,并允许您使用 PDF 模板,而不是从头开始编写整个内容。

Did you have a look at the CombinePDF library?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

转身以后 2024-07-24 09:14:26

以下是一些选项:

http://en.wikipedia.org/wiki/List_of_PDF_software

从该链接,并搜索 sourceforge,有几个命令行实用程序可以执行您想要的操作,例如: http:// pdftohtml.sourceforge.net/

根据您的要求和 PDF 的外观,您可以考虑使用 Google Docs API(上传 PDF,然后将其作为文本下载),或者也可以尝试类似 gocr。 过去我很幸运地用 gocr 解析图像文本,你只需要跳到 shell 就可以做到这一点,就像 gocr -iwhatever.pdf (我认为它适用于 PDF)。

所有这些的缺点是它们不是纯 Ruby 实现,但许多优秀(且免费)的 OCR 项目似乎都是通过这种方式完成的。

Here's some options:

http://en.wikipedia.org/wiki/List_of_PDF_software

From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).

The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

养猫人 2024-07-24 09:14:26

如果您只需要从 pdf 文件中获取文本内容,sourceforge 上的 pdftohtml 非常高效。
它不适合处理图像。

If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient.
it is not suited for dealing with images.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文