如何使用MATLAB分割文本图像?

发布于 2024-08-28 00:50:58 字数 70 浏览 5 评论 0原文

这是OCR过程的一部分,即:

如何将句子分割成单词,然后分割成字符?

该任务的候选算法是什么?

It's part of the process of OCR,which is :

How to segment the sentences into words,and then characters?

What's the candidate algorithm for this task?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

薯片软お妹 2024-09-04 00:50:58

第一步:

  • 将文本处理成行
  • 将一行处理成段(连接的部分)
  • 找到可以放置在每对段之间的最大白带。
  • 查看宽度顺序并选择“大”宽度作为空白。
  • 空格之间的所有内容都是一个单词。

现在你只需要一个足够好的“大”定义。

As a first pass:

  • process the text into lines
  • process a line into segments (connected parts)
  • find the largest white band that can be placed between each pair of segments.
  • look at the sequence of widths and select "large" widths as white space.
  • everything between white space is a word.

Now all you need a a good enough definition of "large".

柠檬色的秋千 2024-09-04 00:50:58

首先,NIST(美国国家标准与技术研究院)发布了一个协议 大约 15 年前被称为NIST 基于表单的手写识别系统,用于解决这个确切的问题,即提取和准备文本图像数据用于 OCR 机器学习算法的输入。 NIST 该小组的成员还就此发表了许多论文系统。

他们的分类器的性能通过与该算法一起发布的数据(“NIST 手写样本表格”)来证明。

我下载和使用的大约六个 OCR 数据集中的每一个都引用了使用的数据提取/准备协议NIST 准备数据以输入其算法。特别是,我非常确定这是准备波士顿大学手写数字数据库所依赖的方法,该数据库被视为 OCR 的基准参考数据。

因此,即使 NIST 协议不是真正的标准,至少它是一种经过验证的方法,用于准备文本作为图像以输入 OCR 算法。我建议从那里开始,并使用该协议来准备您的数据,除非您有充分的理由不这样做。

总之,NIST 数据是通过直接从预打印表格中提取 32 位 x 32 位标准化位图来准备的。

这是一个示例:

00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000

我相信BU数据准备技术包含了NIST技术,但在最后添加了一些步骤,不是为了更高的保真度,而是为了减小文件大小。特别是,BU 组:

  • 以 32 x 32 位图开始;然后
  • 将每个 32 x 32 位图划分为
    不重叠的 4x4 块;
  • 接下来,他们数了数
    每个块中激活的像素(“1”
    已激活; “0”不是);
  • 结果是一个 8 x 8 输入矩阵
    其中每个元素都是整数(0-16)

First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.

The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")

Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.

So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.

In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.

Here's an example:

00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000

I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:

  • began with the 32 x 32 bitmaps; then
  • divided each 32 x 32 bitmap into
    non-overlapping blocks of 4x4;
  • Next, they counted the number of
    activated pixels in each block ("1"
    is activated; "0" is not);
  • the result is an 8 x 8 input matrix
    in which each element is an integer (0-16)
桃酥萝莉 2024-09-04 00:50:58

用于查找二进制序列,如 101000000000000000010000001
检测序列 0000,0001,001,01,1

for finding binary sequence like 101000000000000000010000001
detect sequence 0000,0001,001,01,1

删除会话 2024-09-04 00:50:58

我假设您正在使用 matlab 中的图像处理工具箱。

区分图像中的文本。您可能想要遵循:

  1. 灰度(大大加快速度)。
  2. 对比度增强。
  3. 轻微侵蚀图像以消除噪音(划痕/斑点)
  4. 膨胀(严重)。
  5. 边缘检测(或 ROI 计算)。

通过试错,您将获得正确的系数,以便在第五步之后获得的图像将包含每个字母/单词/行/段落周围的凸区域。

注意:

  1. 本质上,扩张得越多,得到的元素就越大。即,最小的膨胀对于识别字母是有用的,而识别行和段落则需要相对较高的膨胀。
  2. 在线 ImgProc MATLAB 文档

查看“文档中的示例”部分在在线文档中或参考 Matlab 帮助菜单中的图像处理工具箱文档。

那里给出的示例将指导您调用正确的函数及其各种格式。

示例代码(不是我的)

I am assuming you are using the image-processing toolbox in matlab.

To distinguish text in an image. You might want to follow:

  1. Grayscale (speeds up things greatly).
  2. Contrast enhancement.
  3. Erode the image lightly to remove noise (scratches/blips)
  4. Dilation (heavy).
  5. Edge-Detection ( or ROI calculation).

With Trial-and-error, you'll get the proper coefficients such that the image you obtain after 5th step will contain convex regions surrounding each letter/word/line/paragraph.

NOTE:

  1. Essentially the more you dilate, the larger element you get. i.e. least dilation would be useful in identifying letters, whereas comparitively high dilation would be needed to identify lines and paragraphs.
  2. Online ImgProc MATLAB docs

Check out the "Examples in Documentation" section in the online docs or refer to the image-processing toolbox documentation in Matlab Help menu.

The examples given there will guide you to the proper functions to call and their various formats.

Sample CODE (not mine)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文