Java OCR 库:在 Windows 64 位上编译 tesseract
我正在使用 ImageJ 处理文档图像(商业文档),并且正在寻找一个好的 OCR 库来检索某些区域的文本。
目前我正在使用Asprise,但结果不太可靠。某些字符经常会被混淆(0 变成 O、8 -> B、% -> 0,...),然后在不应该出现的地方出现空白,从而给这些数据的后处理带来很多麻烦。图像的分辨率为 1240x1754,我还没有尝试过更高的分辨率,但我想要检测的最小字符是 15 像素高,所以我认为图像的质量足够了。 (顺便说一句,我是在原始图像上执行 ocr,而不是二进制图像)
在这里查看类似的问题,我注意到 tesseract 经常被推荐。由于它是用 C++ 编写的,我不确定如何在 Java 和 ImageJ 中使用它。
据我所知,Asprise 也是用 C++ 编写的,只提供了一个 Java 包装器,我可以使用 Asprise 基于 BufferedImage 执行 ocr。所以我想我可以用 tesseract 做同样的事情。
1.如何从 Java 调用 tesseract 函数?
更新: 我尝试使用 tesseract,但是当我执行应用程序时,它由于以下原因而崩溃:
UnsatisfiedLinkError:C:\Windows\System32\tessdll.dll:找不到依赖库
我能够成功编译 tesseract 和 tesseract 2.04 并将 tessdll.dll 和 tesseract.dll 放在 c:\windows\system32 中。我正在使用这个静态块来加载库:
static
{
System.loadLibrary("tessdll");
System.loadLibrary("tesjeract");
}
如果相关的话,我使用的是 Windows 7 64 位。
2.那么如何将 BufferedImage 转换为 tesseract 能够使用的格式?
已解决
如果有人感兴趣的话,这是代码:(源自 audiveris )
private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageOutputStream ios = ImageIO.createImageOutputStream(baos);
// Take the first suitable TIFF writer
ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
writer.setOutput(ios);
writer.write(image);
ios.close();
// allocate() doesn't work
ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
buf.put(baos.toByteArray());
return buf;
}
I'm using ImageJ for processing document images (business documents) and I am looking for a good OCR library to retrieve text from some regions.
Currently I am using Asprise, but the results aren't very reliable. Certain characters often get confused (0 becomes O, 8 -> B, % -> 0, ...), then there is whitespace where it's not supposed to be and thus makes a lot of trouble postprocessing this data. The images have a resolution of 1240x1754, I haven't tried a higher resolution yet, but the smallest characters I'd like to detect are 15 pixels high, so I assume the quality of the image is sufficient. (by the way, I am performing the ocr on the original image, not the binary image)
Looking at similar questions here, I noticed tesseract has often been recommeneded. Since it's written in c++ I am not sure how I can use it in Java and ImageJ.
Using Asprise, which, as far I know, is also written in c++ and just offers a Java wrapper, I can perform ocr based on a BufferedImage. So I am assume I could do the same with tesseract.
1. How can I call tesseract functions from Java?
UPDATE: I tried to use tesjeract, but when I am executing my application it crashes due to a
UnsatisfiedLinkError: C:\Windows\System32\tessdll.dll: Can't find dependent libraries
I was able to successfully compile tesjeract and tesseract 2.04 and placed tessdll.dll and tesjeract.dll in c:\windows\system32 . I am using this static block to load the libraries:
static
{
System.loadLibrary("tessdll");
System.loadLibrary("tesjeract");
}
In case it is relevant, I am using Windows 7 64-bit.
2. So how can I convert a BufferedImage into a format tesseract is able to work with?
SOLVED
This is the code, if anyone is interested: (originates from audiveris )
private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageOutputStream ios = ImageIO.createImageOutputStream(baos);
// Take the first suitable TIFF writer
ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
writer.setOutput(ios);
writer.write(image);
ios.close();
// allocate() doesn't work
ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
buf.put(baos.toByteArray());
return buf;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
外部库的位数需要与您的 JVM 相匹配。由于 tesjeract 是最小公分母,因此您需要使用 32 位 JVM。
The bitness of the external libraries need to match up with your JVM. Since tesjeract is the lowest common denominator, you'll need to use the 32 bit JVM.
您可以查看 audiveris,这是一个 Java OMR 包,我相信它使用 Tesseract 来处理工作表的文本部分。
You could look at audiveris, a Java OMR package that I believe uses Tesseract for the text portions of sheets.
您可能想看一下 Tesseract 2.04 的两个 Java 包装器:Tess4J 和 Tesjeract。
There are two Java wrappers for Tesseract 2.04 that you may want to take a look: Tess4J and Tesjeract.