Java OCR 库:在 Windows 64 位上编译 tesseract

发布于 2024-12-07 17:14:59 字数 1707 浏览 1 评论 0原文

我正在使用 ImageJ 处理文档图像(商业文档),并且正在寻找一个好的 OCR 库来检索某些区域的文本。

目前我正在使用Asprise,但结果不太可靠。某些字符经常会被混淆(0 变成 O、8 -> B、% -> 0,...),然后在不应该出现的地方出现空白,从而给这些数据的后处理带来很多麻烦。图像的分辨率为 1240x1754,我还没有尝试过更高的分辨率,但我想要检测的最小字符是 15 像素高,所以我认为图像的质量足够了。 (顺便说一句,我是在原始图像上执行 ocr,而不是二进制图像)

在这里查看类似的问题,我注意到 tesseract 经常被推荐。由于它是用 C++ 编写的,我不确定如何在 Java 和 ImageJ 中使用它。

据我所知,Asprise 也是用 C++ 编写的,只提供了一个 Java 包装器,我可以使用 Asprise 基于 BufferedImage 执行 ocr。所以我想我可以用 tesseract 做同样的事情。

1.如何从 Java 调用 tesseract 函数?

更新: 我尝试使用 tesseract,但是当我执行应用程序时,它由于以下原因而崩溃:

UnsatisfiedLinkError:C:\Windows\System32\tessdll.dll:找不到依赖库

我能够成功编译 tesseract 和 tesseract 2.04 并将 tessdll.dll 和 tesseract.dll 放在 c:\windows\system32 中。我正在使用这个静态块来加载库:

static
{
    System.loadLibrary("tessdll");
    System.loadLibrary("tesjeract");
}

如果相关的话,我使用的是 Windows 7 64 位。


2.那么如何将 BufferedImage 转换为 tesseract 能够使用的格式?

已解决

如果有人感兴趣的话,这是代码:(源自 audiveris )

private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageOutputStream     ios = ImageIO.createImageOutputStream(baos);

    // Take the first suitable TIFF writer
    ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
    writer.setOutput(ios);
    writer.write(image);
    ios.close();

    // allocate() doesn't work
    ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
    buf.put(baos.toByteArray());

    return buf;
}

I'm using ImageJ for processing document images (business documents) and I am looking for a good OCR library to retrieve text from some regions.

Currently I am using Asprise, but the results aren't very reliable. Certain characters often get confused (0 becomes O, 8 -> B, % -> 0, ...), then there is whitespace where it's not supposed to be and thus makes a lot of trouble postprocessing this data. The images have a resolution of 1240x1754, I haven't tried a higher resolution yet, but the smallest characters I'd like to detect are 15 pixels high, so I assume the quality of the image is sufficient. (by the way, I am performing the ocr on the original image, not the binary image)

Looking at similar questions here, I noticed tesseract has often been recommeneded. Since it's written in c++ I am not sure how I can use it in Java and ImageJ.

Using Asprise, which, as far I know, is also written in c++ and just offers a Java wrapper, I can perform ocr based on a BufferedImage. So I am assume I could do the same with tesseract.

1. How can I call tesseract functions from Java?

UPDATE: I tried to use tesjeract, but when I am executing my application it crashes due to a

UnsatisfiedLinkError: C:\Windows\System32\tessdll.dll: Can't find dependent libraries

I was able to successfully compile tesjeract and tesseract 2.04 and placed tessdll.dll and tesjeract.dll in c:\windows\system32 . I am using this static block to load the libraries:

static
{
    System.loadLibrary("tessdll");
    System.loadLibrary("tesjeract");
}

In case it is relevant, I am using Windows 7 64-bit.


2. So how can I convert a BufferedImage into a format tesseract is able to work with?

SOLVED

This is the code, if anyone is interested: (originates from audiveris )

private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageOutputStream     ios = ImageIO.createImageOutputStream(baos);

    // Take the first suitable TIFF writer
    ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
    writer.setOutput(ios);
    writer.write(image);
    ios.close();

    // allocate() doesn't work
    ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
    buf.put(baos.toByteArray());

    return buf;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

亽野灬性zι浪 2024-12-14 17:14:59

外部库的位数需要与您的 JVM 相匹配。由于 tesjeract 是最小公分母,因此您需要使用 32 位 JVM。

The bitness of the external libraries need to match up with your JVM. Since tesjeract is the lowest common denominator, you'll need to use the 32 bit JVM.

失退 2024-12-14 17:14:59

您可以查看 audiveris,这是一个 Java OMR 包,我相信它使用 Tesseract 来处理工作表的文本部分。

You could look at audiveris, a Java OMR package that I believe uses Tesseract for the text portions of sheets.

绳情 2024-12-14 17:14:59

您可能想看一下 Tesseract 2.04 的两个 Java 包装器:Tess4JTesjeract

There are two Java wrappers for Tesseract 2.04 that you may want to take a look: Tess4J and Tesjeract.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文