We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
Tesseract是一个非常好的OCR引擎:https://github.com/tesseract-ocr/tesseract
该项目已由 HP 实验室发起,现在由 Google 继续并赞助(针对 Google 图书!)。它是在 Apache 许可证下发布的,并且在 Linux 上运行。它使用 Tiff 或 PNG 文件;对于 PDF,您需要转换为其中一种格式。我想没有绑定,所以你应该调用这个软件作为子程序......
Tesseract is a very good OCR engine: https://github.com/tesseract-ocr/tesseract
The project has been launched by HP Labs and is now continued and sponsored by Google (for Google Books !). It is released under the Apache license, and it runs on Linux. It uses Tiff or PNGs files ; for PDFs, you will need to convert to one of these formats. I suppose that there is no binding so you should invoke this software as a subprogram...
Cuneiform 是免费的并且做得不错。您可以将其作为子程序调用,但据我所知没有语言绑定。它不会直接读取 PDF,但您可以轻松地分解作为扫描图像序列的 PDF,并将其输入楔形文字。还有一些脚本可以将图像和文本重新组合成可搜索的 PDF。
Cuneiform is free and does a decent job. You could invoke it as a subprogram but there's no language binding that I know of. It won't read PDFs directly but you can easily take apart PDFs that are sequences of scanned images to feed them to Cuneiform. There are also scripts to reassemble the images and text back into a searchable PDF.
尝试 tesjeract,它使用 JNI 调用 Tesseract OCR API。
对于 PDF,您需要首先使用 GhostScript 将它们转换为图像。
Try tesjeract, which uses JNI to call Tesseract OCR API.
For PDF, you'll need to convert them to image first, using GhostScript, for instance.