如何将外部 OCR 嵌入到现有 PDF 中?
我有一组图像,在其上运行 OCR 应用程序。此过程会生成带有字符偏移量的 XML 文件。然后我使用 Acrobat 9 将图像转换为 PDF。现在,我想将 XML 文件信息作为不可见的文本层添加到 PDF 中,以实现可搜索的 PDF。有没有简单又免费的方法呢?
一些详细信息:
我不想使用 Acrobat 的 OCR 功能;
OCR 过程会生成一个 XML 文件,其中包含以下元素:
这是来自图像的文本示例行
更新:可能可以以不同的方式执行我想要的操作。假设已经有一个由一组图像生成的 PDF 文件,并且其中已经包含 OCRed 文本。是否可以(也许以编程方式)仅访问每个页面的图像,对其进行处理(例如,将其转换为单色),然后将其保存回 PDF 文件?如果是,则 ORed 文本不会丢失。
[我应该将此更新放入一个单独的问题中吗?]
I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?
Some details:
I don't want to use Acrobat's OCR functionality;
The OCR process results in a XML file which contains elements like:
<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>
Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.
[Should I put this update into a separate question?]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于您关于处理 PDF 文件而不丢失隐藏层的后续问题:我相信 Ghostscript 能够做到这一点。例如,以下命令应将 PDF 转换为灰度:
For your follow-up question about processing PDF files without losing the the hidden layers: I believe Ghostscript is able to do this. For example, the following command should convert a PDF to grayscale:
如果您只想将现有的 pdf 转换为灰度,请尝试 Imagemagick:
我认为这不会更改 pdf 中的任何其他属性。
If all you want to do is convert an existing pdf to grayscale, try Imagemagick:
I don't think this will change any other attributes in your pdf.