使用 C# 从 PDF 中提取文本

发布于 2024-08-19 07:33:36 字数 1704 浏览 2 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

一袭水袖舞倾城 2024-08-26 07:33:36

可靠地做到这一点可能会有些困难。问题在于 PDF 是一种重视良好排版的演示格式。假设您只想输出一个单词:Tap

PDF 渲染引擎可能会将其输出为 2 个单独的调用,如以下伪代码所示:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

这样做是因为字母 T 和 a 之间的默认字距调整(字母间距)可能不是渲染引擎可以接受,或者它可能会添加或删除字符之间的一些微小空间以获得完全合理的线条。最终导致的结果是,在 PDF 中找到的实际文本片段通常不是完整的单词,而是其中的片段。

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

故事与诗 2024-08-26 07:33:36

看看 DotNet 上的 Tika,可通过 Nuget 获取:
https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

这是使用 IKVM 封装了非常好的 Tika java 库。非常易于使用,可处理除 PDF 之外的各种文件类型,包括新旧办公格式。它将根据文件扩展名自动选择解析器,因此非常简单:

var text = new TextExtractor().Extract(file.FullName).Text;

更新: 此解决方案的一个警告是 IKVM 的开发已经结束。我不确定这从长远来看意味着什么。 http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

Take a look at Tika on DotNet, available through Nuget:
https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:

var text = new TextExtractor().Extract(file.FullName).Text;

Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

月朦胧 2024-08-26 07:33:36

如果您处理 PDF 文件的目的是将数据导入数据库,那么我建议考虑 ByteScout PDF 提取器 SDK。一些有用的功能包括

  • 表检测;
  • 将文本提取为 CSV、XML 或格式化文本(具有可选的布局恢复功能);
  • 支持正则表达式的文本搜索;
  • 用于访问文本对象的低级 API

免责声明:我隶属于 ByteScout

In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are

  • table detection;
  • text extraction as CSV, XML or formatted text (with the optional layout restoration);
  • text search with support for regular expressions;
  • low-level API to access text objects

DISCLAIMER: I'm affiliated with ByteScout

鸠书 2024-08-26 07:33:36

您可以尝试 Toxy,.NET 中的文本/数据提取框架。它支持.NET 标准2.0。详情请访问https://github.com/nissl-lab/toxy

You can try Toxy, a text/data extraction framework in .NET. It supports .NET standard 2.0. For detail, please visit https://github.com/nissl-lab/toxy

橘和柠 2024-08-26 07:33:36

您可以尝试 Docotic.Pdf 库(免责声明:我在 Bit Miracle 工作)从 PDF 中提取文本文件。该库使用一些启发式方法来提取美观的文本,而单词中的字母之间不会出现不需要的空格。

请查看如何从 PDF 中提取文本的示例。

You can try Docotic.Pdf library (disclaimer: I work for Bit Miracle) to extract text from PDF files. The library uses some heuristics to extract nice looking text without unwanted spaces between letters in words.

Please take a look at a sample that shows how to extract text from PDF.

|煩躁 2024-08-26 07:33:36

如果您正在寻找“免费”替代方案,请查看 PDF Clown。我个人使用过基于 iFilter 的方法,如果您需要轻松支持其他文件类型,它似乎工作得很好。示例代码此处

If you're looking for "free" alternative, check out PDF Clown. I personally have used iFilter based approach, and it seems to work fine in case you would need to support other file types easily. Sample code here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文