图像处理/超轻OCR

发布于 2024-07-29 19:58:19 字数 576 浏览 5 评论 0原文

我有 55 000 个图像文件(JPG 和 TIFF 格式),它们是书中的图片。

每个页面的结构是这样的:

一些文字

---(水平线)---

一个数字

一些文字

---(水平线)---

另一个号码

一些文字

任何给定页面上可以有 0 到 4 条水平线。

我需要找到水平线下方的数字是多少。

但是,数字严格地相互遵循,从第一页的一个开始,所以为了找到数字,我不需要阅读它:我可以检测水平线的存在,这应该比尝试对页面进行 OCR 来检测数字。

该算法基本上是:

for each image
  count horizontal lines
  print image name, number of horizontal lines
  next image

问题是:执行“计算水平线”部分的最佳图像库/语言是什么?

I have 55 000 image files (in both JPG and TIFF format) which are pictures from a book.

The structure of each page is this:

some text

--- (horizontal line) ---

a number

some text

--- (horizontal line) ---

another number

some text

There can be from zero to 4 horizontal lines on any given page.

I need to find what the number is, just below the horizontal line.

BUT, numbers strictly follow each other, starting at one on page one, so in order to find the number, I don't need to read it: I could just detect the presence of horizontal lines, which should be both easier and safer than trying to OCR the page to detect the numbers.

The algorithm would be, basically:

for each image
  count horizontal lines
  print image name, number of horizontal lines
  next image

The question is: what would be the best image library/language to do the "count horizontal lines" part?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

南…巷孤猫 2024-08-05 19:58:19

检测线条的最简单方法可能是使用霍夫变换OpenCV(它有许多语言的包装器)中。

OpenCV 霍​​夫变换将检测图像中的所有线条并返回它们的角度和开始/停止坐标。 您应该只保留角度接近水平且长度足够的那些。

O'Reilly 的学习 OpenCV 详细解释函数的输入和输出 (p.156)。

Probably the easiest way to detect your lines is using the Hough transform in OpenCV (which has wrappers for many languages).

The OpenCV Hough tranform will detect all lines in the image and return their angles and start/stop coordinates. You should only keep the ones whose angles are close to horizontal and of adequate length.

O'Reilly's Learning OpenCV explains in detail the function's input and output (p.156).

唔猫 2024-08-05 19:58:19

如果对比度良好,请尝试运行连接的组件并分析结果。 当您的结构化元素有点弯曲或线条算法拾取您不希望它拾取的线条时,它可以作为通过霍夫查找线条的替代方案并覆盖情况。

连接组件是一种超快的两光栅扫描算法,将为您提供一个掩模,其中包含所有连接的元素,并标有不同的标签并进行说明。 您可以丢弃任何短的内容(就长宽比而言)。 总的来说,这可能比运行霍夫变换更通用、更快,但可能更复杂一些。 另一方面,霍夫变换更能容忍对比度伪影,甚至线条中的意外间隙。
OpenCV 具有 findContours() 函数,可以为您查找组件。

If you have good contrast, try running connected components and analyze the result. It can be an alternative to finding lines through Hough and cover the case when your structured elements are a bit curved or a line algorithm picks up the lines you don’t want it to pick up.

Connected components is a super fast, two raster scan algorithm and will give you a mask with all you connected elements in it marked with different labels and accounted for. You can discard anything short ( in terms of aspect ratio). Overall, this can be more general, faster but probably a bit more involved than running Hough transform. The Hough transform on the other hand will be more tolerable for contrast artifacts and even accidental gaps in lines.
OpenCV has the function findContours() that find components for you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文