扫描文档中背景/前景图层的分离

发布于 2024-10-05 09:58:21 字数 338 浏览 4 评论 0原文

我需要自动删除扫描文档图像的浅色背景以进行 OCR。

ScanTailor 是一个基于 C++ GUI 的开源应用程序,可实现背景分离等功能,但我不知道如何实现仅运行最后一步,实际删除背景。

理想情况下,我可以找到执行此操作的代码,或者:

  1. 将该部分移植到 C#
  2. 修改 C++ 以响应命令行执行,仅在给定图像上执行该步骤

您能帮助我了解如何执行任一操作吗?
或者你知道其他库可以做到这一点吗? (可接受任何语言/平台)

I need to automatically remove the mildly colored background of a scanned document image for OCR.

ScanTailor is an open source C++ GUI-based app that does background separation among other things, but I cannot figure out how to run only the last step which actually removes the background.

Ideally, I could find the code that does this and either:

  1. Port that part to C#
  2. Modify the C++ to respond to command line execution, only performing that step on a given image

Can you help me understand how I can do either?
or do you know other libraries that can do this? (any language/platform acceptable)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

塔塔猫 2024-10-12 09:58:21

您指的是 OCR 应用中必需的阈值处理、去斑和噪声消除技术。

结果的质量很大程度上取决于许多不同的因素 -

原件的打印质量
扫描质量
图像分辨率
使用的背景颜色和图案。
噪音和其他痕迹。

您可以在 http://www.hi-components.com/nievolution 找到 IEvolution.NET 库.asp很有用。它有许多图像处理功能可以使用。

有许多可用的商用发动机。没有一种完美的函数可以解决图像处理问题。您必须调整函数和参数以匹配您的图像。 http://www.recogniform.com/thresholding.htm

Google 搜索将显示大量结果。

You are referring to Thresholding, Despeckling and Noise Removal techniques which are necessary in OCR applications.

The quality of the results depends very much an many different factors -

Print quality of the original
Scan quality
Image resolution
Background colours and patterns used.
Noise and other marks.

You may find the IEvolution.NET library at http://www.hi-components.com/nievolution.asp useful. It has many image processing functions to play with.

There are many commercial engines available. There is no one perfect function to solve image processing problems. You must adapt the functions and parameter to match your images. http://www.recogniform.com/thresholding.htm

A Google search will show up lots of results.

痕至 2024-10-12 09:58:21

也许该算法大约是:

  • 确定背景颜色是什么
  • 扫描位图以查找颜色为(和/或足够相似)背景颜色的像素
  • 将这些像素转换为白色或透明
  • 可能(特别是如果页面包含图像而不包含图像)只是文本)忽略孤立的像素,这些像素是背景颜色,但不与其他背景像素相邻

如果它是高分辨率低颜色深度(例如黑白)图像,那么您需要应用此算法到像素组

Maybe the algorithm is, approximately:

  • Decide what the background color is
  • Scan the bitmap for pixels whose color is (and/or is sufficiently similar to) the background color
  • Convert these pixels to white or transparent
  • Possibly (especially if the page contains images and not just text) ignore isolated pixels, which are the background color but are not next to other also-background pixels

If it's a high-resolution low-color-depth (e.g. black-and-white) image, then you need to apply this algorithm to groups of pixels.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文