将文档从 PDF 转换为文本时如何处理 unicode 字符编码问题
我正在尝试从 PDF 中提取文本。 PDF 包含印地语 (Unicode) 文本。我使用的提取实用程序是 Apache PDFBox (http://pdfbox.apache.org/)。提取器提取文本,但文本无法识别。我尝试在许多编码和字体之间进行更改,但仍然无法识别预期的文本。 这是一个例子: 假设 PDF 中的文本为:पवार
提取后的样子是:̄Ö3⁄4ÖÖ ̧ü
有什么建议吗?
I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü
are there any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
PDF 本质上是一种打印格式,因此将文本记录为一系列视觉符号,而不是实际文本。最初它从未打算作为数字存档格式,并且仍然出现在许多文档中。对于复杂的脚本,例如需要字形替换、连接和重新排序的阿拉伯语或印度语脚本,基本上,您通常会遇到混乱。你通常得到的是嵌入字体中使用的字形 ID,它们与 Unicode 或实际的文本编码没有任何相似之处(字体代表字形,其中一些可能映射到 Unicode 代码点,但有些只是需要的)用于字体内部使用,例如基于上下文或连字的字形变体)。您可以在 LaTeX 生成的 PDF 中看到同样的情况,尤其是非 ASCII 字符和数学。
PDF 还具有将文本作为文本嵌入到视觉表示旁边的功能,但这完全由生成应用程序自行决定。我听说 Word 在生成 PDF 时非常努力地保留这些信息,但许多 PDF 生成器却没有这样做(它通常对拉丁语有一定的作用,这可能就是为什么几乎没有人打扰的原因)。
我认为,如果 PDF 没有可用的纯文本,最好的选择是对 PDF 进行 OCR 作为图像。
PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.
PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).
I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.