如何从 PDF 中提取文本?

发布于 2024-09-18 00:58:44 字数 1809 浏览 9 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

初熏 2024-09-25 00:58:45

我知道这个话题已经很老了,但这种需求仍然存在。我阅读了许多文档、论坛和脚本,并构建了一个新的高级文档,支持压缩和未压缩的 pdf:

https:// gist.github.com/smalot/6183152

在某些情况下,出于安全原因,禁止使用命令行。
所以一个原生的 PHP 类可以满足很多需求。

希望对大家有帮助

I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :

https://gist.github.com/smalot/6183152

In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.

Hope it helps everone

腹黑女流氓 2024-09-25 00:58:45

对于图像提取,pdfimages 是适用于 Linux 或 Windows (win32) 的免费命令行工具:

pdfimages:从便携式文档格式 (PDF) 文件中提取并保存图像

For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):

pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File

唯憾梦倾城 2024-09-25 00:58:45

Apache pdfbox 具有此功能 - 文本部分描述于:

http: //pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html

示例实现请参见
https://github.com/WolfgangFahl/pdfindexer

测试用例 TestPdfIndexer.testExtracting 展示了它的工作原理

Apache pdfbox has this feature - the text part is described in:

http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html

for an example implementation see
https://github.com/WolfgangFahl/pdfindexer

the testcase TestPdfIndexer.testExtracting shows how it works

夏雨凉 2024-09-25 00:58:45

QuickPDF 似乎是一个合理的库,应该以合理的价格满足您的需求。

http://www.quickpdflibrary.com/ - 他们有 30 天的试用期。

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.

http://www.quickpdflibrary.com/ - They have a 30 day trial.

灼痛 2024-09-25 00:58:45

在我的 Macintosh 系统上,我发现“Adobe Reader”的工作相当不错。我在桌面上创建了一个指向“Adobe Reader.app”的别名,我所做的就是在别名上放置一个 pdf 文件,这使其成为 Adob​​e Reader 中的活动文档,然后从“文件”菜单中,我选择“另存为文本...”,为其命名并指定保存位置,单击“保存”,然后就完成了。

On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

不羁少年 2024-09-25 00:58:44

我收到了一个 400 页的 pdf 文件,其中包含我必须导入的数据表 - 幸运的是没有图像。 Ghostscript 为我工作:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

输出文件是分成带有标题等的页面,但随后很容易编写一个应用程序来删除空白行等,并吸收所有 30,000 条记录。 -dSIMPLE-dCOMPLEX 在这种情况下没有区别。

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.

尛丟丟 2024-09-25 00:58:44

一个高效的命令行工具,开源,免费,在linux和Linux上都可用。 windows :简单命名为 pdftotext。该工具是 xpdf 库的一部分。

http://en.wikipedia.org/wiki/Pdftotext

An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.

http://en.wikipedia.org/wiki/Pdftotext

白馒头 2024-09-25 00:58:44

从今天起我知道了:从 PDF 中提取文本的最佳方法TET,文本提取工具包。 TET 是 PDFlib.com 产品系列的一部分。

PDFlib.com 是 Thomas Merz 的公司。如果您不认识他的名字:Thomas Merz 是“PostScript 和 PDF 圣经”的作者。

TET 的第一个化身是一个库。这个人可能可以做 Budda006 想要的一切,包括页面上每个元素的位置信息。哦,它还可以提取图像。它将碎片化的图像重新组合起来。

pdflib.com 还提供了该技术的另一种体现,即Acrobat 的 TET 插件。第三个版本是 PDFlib TET iFilter 。这是一个用于用户桌面的独立工具。这两者都是免费的(如啤酒),可用于私人、非商业目的。

而且它真的很强大。比 Adob​​e 自己的文本提取要好得多。它为我提取了文本,而其他工具(包括 Adob​​e 的)只能吐出垃圾。

我刚刚测试了桌面独立工具,他们在网页上所说的是真的。它有一个非常好的命令行。该工具处理的一些“有问题”的 PDF 测试文件让我完全满意。

从现在起,这个东西将成为我针对每一个复杂且具有挑战性的 PDF 文本提取要求的推荐。

TET 简直太棒了。它检测表。在表格内部,它标识跨越多列的单元格。它分别标识表格行和每个表格单元格的内容。它可以很好地处理连字符:它删除连字符并恢复完整的单词。它支持非 ASCII 语言(包括 CJK、阿拉伯语和希伯来语)。当遇到连字时,它会恢复原始字符......

试试吧。

Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

薄荷港 2024-09-25 00:58:44

对于 python,有 PDFMinerpyPDF2。有关这些的详细信息,请参阅用于将 PDF 转换为文本的 Python 模块

For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.

眉黛浅 2024-09-25 00:58:44

这是我的建议。
如果您想从 PDF 中提取文本,您可以将 pdf 文件导入 Google Docs,然后将其导出为更友好的格式,例如 .html、.odf、.rtf、.txt 等。所有这一切都使用 Drive API 。它是免费*且强大的。请查看:

https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get< /a>

因为它是一个 REST API,所以它与所有编程语言兼容。我上面发布的链接提供了许多语言的工作示例,包括:Java、.NET、Python、PHP、Ruby 等。

我希望它有帮助。

Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:

https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get

Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.

I hope it helps.

寄与心 2024-09-25 00:58:44

PdfTextStream(您说过您一直在查看)现在对单人免费线程应用程序。在我看来,它的质量比其他库好得多(特别是对于时髦的嵌入字体等)。

它可用于 Java 和 C#。

或者,您应该看看开源的 Apache PDFBox

PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).

It is available in Java and C#.

Alternatively, you should have a look at Apache PDFBox, open source.

贪恋 2024-09-25 00:58:44

这里的评论之一在 Windows 上使用了 gs 。我在 Linux/OSX 上也取得了一些成功,语法如下:

gs \
 -q \
 -dNODISPLAY \
 -dSAFER \
 -dDELAYBIND \
 -dWRITESYSTEMDICT \
 -dSIMPLE \
 -f ps2ascii.ps \
 "${input}" \
 -dQUIET \
 -c quit

我使用 dSIMPLE 而不是 dCOMPLEX 因为后者每行输出 1 个字符。

One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:

gs \
 -q \
 -dNODISPLAY \
 -dSAFER \
 -dDELAYBIND \
 -dWRITESYSTEMDICT \
 -dSIMPLE \
 -f ps2ascii.ps \
 "${input}" \
 -dQUIET \
 -c quit

I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.

停滞 2024-09-25 00:58:44

Docotic.Pdf 库 可用于 从 PDF 文件中提取文本作为纯文本或作为文本块的集合以及每个块的坐标。

Docotic.Pdf 可用于从PDF 也是如此。

免责声明:我为 Bit Miracle 工作。

Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.

Docotic.Pdf can be used to extract images from PDFs, too.

Disclaimer: I work for Bit Miracle.

有深☉意 2024-09-25 00:58:44

由于问题具体是关于从 PDF 中获取数据为 XML 的替代工具,因此您可能有兴趣查看商业工具 "ByteScout PDF Extractor SDK" 能够准确执行此操作:从 PDF 中提取文本作为 XML 以及定位数据 (x ,y) 和字体信息:

源 PDF 中的文本:

Products | Units | Price 

输出 XML:

 <row>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text> 
  </column>
</row>

PS:此外,它还将文本分解为基于表格的结构。

披露:我在 ByteScout 工作

As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:

Text in the source PDF:

Products | Units | Price 

Output XML:

 <row>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text> 
  </column>
</row>

P.S.: additionally it also breaks the text into a table based structure.

Disclosure: I work for ByteScout

安穩 2024-09-25 00:58:44

我目前能想到的最好的东西(在“简单”工具列表中)是 Ghostscript (当前版本是 v.8.71) 和 PostScript 实用程序 ps2ascii.ps。 Ghostscript 将其放置在其 lib 子目录中。试试这个(在 Windows 上):

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dCOMPLEX ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   input.pdf ^
   -dQUIET ^
   -c quit

此命令处理 input.pdf 的第 3-7 页。阅读 ps2ascii.ps 文件本身中的注释,了解“奇怪”的数字和附加信息的含义(它们表示字符串、位置、宽度、颜色、图片、矩形、字体和页面)中断...)。要获得“简单”文本输出,请将 -dCOMPLEX 部分替换为 -dSIMPLE

The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dCOMPLEX ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   input.pdf ^
   -dQUIET ^
   -c quit

This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文