Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
The community reviewed whether to reopen this question 2 years ago and left it closed:
Original close reason(s) were not resolved
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
可靠地做到这一点可能会有些困难。问题在于 PDF 是一种重视良好排版的演示格式。假设您只想输出一个单词:Tap。
PDF 渲染引擎可能会将其输出为 2 个单独的调用,如以下伪代码所示:
这样做是因为字母 T 和 a 之间的默认字距调整(字母间距)可能不是渲染引擎可以接受,或者它可能会添加或删除字符之间的一些微小空间以获得完全合理的线条。最终导致的结果是,在 PDF 中找到的实际文本片段通常不是完整的单词,而是其中的片段。
There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.
A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:
This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.
看看 DotNet 上的 Tika,可通过 Nuget 获取:
https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/
这是使用 IKVM 封装了非常好的 Tika java 库。非常易于使用,可处理除 PDF 之外的各种文件类型,包括新旧办公格式。它将根据文件扩展名自动选择解析器,因此非常简单:
更新: 此解决方案的一个警告是 IKVM 的开发已经结束。我不确定这从长远来看意味着什么。 http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx
Take a look at Tika on DotNet, available through Nuget:
https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/
This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:
Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx
如果您处理 PDF 文件的目的是将数据导入数据库,那么我建议考虑 ByteScout PDF 提取器 SDK。一些有用的功能包括
免责声明:我隶属于 ByteScout
In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are
DISCLAIMER: I'm affiliated with ByteScout
您可以尝试 Toxy,.NET 中的文本/数据提取框架。它支持.NET 标准2.0。详情请访问https://github.com/nissl-lab/toxy
You can try Toxy, a text/data extraction framework in .NET. It supports .NET standard 2.0. For detail, please visit https://github.com/nissl-lab/toxy
您可以尝试 Docotic.Pdf 库(免责声明:我在 Bit Miracle 工作)从 PDF 中提取文本文件。该库使用一些启发式方法来提取美观的文本,而单词中的字母之间不会出现不需要的空格。
请查看如何从 PDF 中提取文本的示例。
You can try Docotic.Pdf library (disclaimer: I work for Bit Miracle) to extract text from PDF files. The library uses some heuristics to extract nice looking text without unwanted spaces between letters in words.
Please take a look at a sample that shows how to extract text from PDF.
如果您正在寻找“免费”替代方案,请查看 PDF Clown。我个人使用过基于 iFilter 的方法,如果您需要轻松支持其他文件类型,它似乎工作得很好。示例代码此处。
If you're looking for "free" alternative, check out PDF Clown. I personally have used iFilter based approach, and it seems to work fine in case you would need to support other file types easily. Sample code here.