Java OCR 库：在 Windows 64 位上编译 tesseract

发布于 2024-12-07 17:14:59 字数 1707 浏览 7 评论 0原文

我正在使用 ImageJ 处理文档图像（商业文档），并且正在寻找一个好的 OCR 库来检索某些区域的文本。

目前我正在使用Asprise，但结果不太可靠。某些字符经常会被混淆（0 变成 O、8 -> B、% -> 0，...），然后在不应该出现的地方出现空白，从而给这些数据的后处理带来很多麻烦。图像的分辨率为 1240x1754，我还没有尝试过更高的分辨率，但我想要检测的最小字符是 15 像素高，所以我认为图像的质量足够了。（顺便说一句，我是在原始图像上执行 ocr，而不是二进制图像）

在这里查看类似的问题，我注意到 tesseract 经常被推荐。由于它是用 C++ 编写的，我不确定如何在 Java 和 ImageJ 中使用它。

据我所知，Asprise 也是用 C++ 编写的，只提供了一个 Java 包装器，我可以使用 Asprise 基于 BufferedImage 执行 ocr。所以我想我可以用 tesseract 做同样的事情。

1.如何从 Java 调用 tesseract 函数？

更新： 我尝试使用 tesseract，但是当我执行应用程序时，它由于以下原因而崩溃：

UnsatisfiedLinkError：C:\Windows\System32\tessdll.dll：找不到依赖库

我能够成功编译 tesseract 和 tesseract 2.04 并将 tessdll.dll 和 tesseract.dll 放在 c:\windows\system32 中。我正在使用这个静态块来加载库：

static
{
    System.loadLibrary("tessdll");
    System.loadLibrary("tesjeract");
}

如果相关的话，我使用的是 Windows 7 64 位。

2.那么如何将 BufferedImage 转换为 tesseract 能够使用的格式？

已解决

如果有人感兴趣的话，这是代码：（源自 audiveris )

private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageOutputStream     ios = ImageIO.createImageOutputStream(baos);

    // Take the first suitable TIFF writer
    ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
    writer.setOutput(ios);
    writer.write(image);
    ios.close();

    // allocate() doesn't work
    ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
    buf.put(baos.toByteArray());

    return buf;
}

原文

I'm using ImageJ for processing document images (business documents) and I am looking for a good OCR library to retrieve text from some regions.

Currently I am using Asprise, but the results aren't very reliable. Certain characters often get confused (0 becomes O, 8 -> B, % -> 0, ...), then there is whitespace where it's not supposed to be and thus makes a lot of trouble postprocessing this data. The images have a resolution of 1240x1754, I haven't tried a higher resolution yet, but the smallest characters I'd like to detect are 15 pixels high, so I assume the quality of the image is sufficient. (by the way, I am performing the ocr on the original image, not the binary image)

Looking at similar questions here, I noticed tesseract has often been recommeneded. Since it's written in c++ I am not sure how I can use it in Java and ImageJ.

Using Asprise, which, as far I know, is also written in c++ and just offers a Java wrapper, I can perform ocr based on a BufferedImage. So I am assume I could do the same with tesseract.

1. How can I call tesseract functions from Java?

UPDATE: I tried to use tesjeract, but when I am executing my application it crashes due to a

UnsatisfiedLinkError: C:\Windows\System32\tessdll.dll: Can't find dependent libraries

I was able to successfully compile tesjeract and tesseract 2.04 and placed tessdll.dll and tesjeract.dll in c:\windows\system32 . I am using this static block to load the libraries:

static
{
    System.loadLibrary("tessdll");
    System.loadLibrary("tesjeract");
}

In case it is relevant, I am using Windows 7 64-bit.

2. So how can I convert a BufferedImage into a format tesseract is able to work with?

SOLVED

This is the code, if anyone is interested: (originates from audiveris )

private ByteBuffer imageToTiffBuffer (BufferedImage image) throws IOException
{
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageOutputStream     ios = ImageIO.createImageOutputStream(baos);

    // Take the first suitable TIFF writer
    ImageWriter writer = ImageIO.getImageWritersByFormatName("tiff").next();
    writer.setOutput(ios);
    writer.write(image);
    ios.close();

    // allocate() doesn't work
    ByteBuffer buf = ByteBuffer.allocateDirect(baos.size());
    buf.put(baos.toByteArray());

    return buf;
}

分享到QQ

分享到微博