There are at least four different ways to get text into a PDF document (in order or likelihood):
Place the text with standard text operators and standard fonts
Place the text with standard text operators with non-standard fonts
Draw one or more images that represent the text
Place the text by manually drawing the glyphs with various PDF graphics commands
Case 1 is typically searchable. Case 2 is searchable if the font and encoding are sane - if they're not (and this is likely the case for non-Latin fonts) then there is probably no reliable way to map the encoded glyphs back to Unicode (and by the way - PDF is fairly Unicode hostile). Case 3 is totally unsearchable without knowing more about how the PDF was generated. Case 4 is totally unsearchable.
That said, all cases cases be read with an OCR engine that understands Arabic. I understand that the Iris engine does Arabic.
It might not actually be text, or it might be in a container that Reader doesn't pay attention to. It's especially common to expand text objects into vector shapes when you're dealing with fonts that most people aren't going to have installed on their system. It looks the same on the screen, but it's not searchable.
发布评论
评论(2)
至少有四种不同的方法可以将文本放入 PDF 文档(按顺序或可能性):
情况 1 通常是可搜索的。
如果字体和编码正常,则情况 2 是可搜索的 - 如果不是(非拉丁字体可能就是这种情况),那么可能没有可靠的方法将编码的字形映射回 Unicode(顺便说一下) - PDF 对 Unicode 相当不利)。
如果不了解 PDF 是如何生成的,情况 3 是完全无法搜索的。
案例4是完全无法搜索到的。
也就是说,所有案例都可以使用理解阿拉伯语的 OCR 引擎来阅读。我了解 Iris 引擎 支持阿拉伯语。
There are at least four different ways to get text into a PDF document (in order or likelihood):
Case 1 is typically searchable.
Case 2 is searchable if the font and encoding are sane - if they're not (and this is likely the case for non-Latin fonts) then there is probably no reliable way to map the encoded glyphs back to Unicode (and by the way - PDF is fairly Unicode hostile).
Case 3 is totally unsearchable without knowing more about how the PDF was generated.
Case 4 is totally unsearchable.
That said, all cases cases be read with an OCR engine that understands Arabic. I understand that the Iris engine does Arabic.
它实际上可能不是文本,或者可能位于 Reader 不注意的容器中。当您处理大多数人不会在系统上安装的字体时,将文本对象扩展为矢量形状尤其常见。它在屏幕上看起来相同,但无法搜索。
It might not actually be text, or it might be in a container that Reader doesn't pay attention to. It's especially common to expand text objects into vector shapes when you're dealing with fonts that most people aren't going to have installed on their system. It looks the same on the screen, but it's not searchable.