如何使用MATLAB分割文本图像?
这是OCR过程的一部分,即:
如何将句子分割成单词,然后分割成字符?
该任务的候选算法是什么?
It's part of the process of OCR,which is :
How to segment the sentences into words,and then characters?
What's the candidate algorithm for this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
第一步:
现在你只需要一个足够好的“大”定义。
As a first pass:
Now all you need a a good enough definition of "large".
首先,NIST(美国国家标准与技术研究院)发布了一个协议 大约 15 年前被称为NIST 基于表单的手写识别系统,用于解决这个确切的问题,即提取和准备文本图像数据用于 OCR 机器学习算法的输入。 NIST 该小组的成员还就此发表了许多论文系统。
他们的分类器的性能通过与该算法一起发布的数据(“NIST 手写样本表格”)来证明。
我下载和使用的大约六个 OCR 数据集中的每一个都引用了使用的数据提取/准备协议NIST 准备数据以输入其算法。特别是,我非常确定这是准备波士顿大学手写数字数据库所依赖的方法,该数据库被视为 OCR 的基准参考数据。
因此,即使 NIST 协议不是真正的标准,至少它是一种经过验证的方法,用于准备文本作为图像以输入 OCR 算法。我建议从那里开始,并使用该协议来准备您的数据,除非您有充分的理由不这样做。
总之,NIST 数据是通过直接从预打印表格中提取 32 位 x 32 位标准化位图来准备的。
这是一个示例:
00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000
我相信BU数据准备技术包含了NIST技术,但在最后添加了一些步骤,不是为了更高的保真度,而是为了减小文件大小。特别是,BU 组:
不重叠的 4x4 块;
每个块中激活的像素(“1”
已激活; “0”不是);
其中每个元素都是整数(0-16)
First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.
The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")
Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.
So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.
In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.
Here's an example:
00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000
I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:
non-overlapping blocks of 4x4;
activated pixels in each block ("1"
is activated; "0" is not);
in which each element is an integer (0-16)
用于查找二进制序列,如 101000000000000000010000001
检测序列 0000,0001,001,01,1
for finding binary sequence like 101000000000000000010000001
detect sequence 0000,0001,001,01,1
我假设您正在使用 matlab 中的图像处理工具箱。
区分图像中的文本。您可能想要遵循:
通过试错,您将获得正确的系数,以便在第五步之后获得的图像将包含每个字母/单词/行/段落周围的凸区域。
注意:
查看“文档中的示例”部分在在线文档中或参考 Matlab 帮助菜单中的图像处理工具箱文档。
那里给出的示例将指导您调用正确的函数及其各种格式。
示例代码(不是我的)
I am assuming you are using the image-processing toolbox in matlab.
To distinguish text in an image. You might want to follow:
With Trial-and-error, you'll get the proper coefficients such that the image you obtain after 5th step will contain convex regions surrounding each letter/word/line/paragraph.
NOTE:
Check out the "Examples in Documentation" section in the online docs or refer to the image-processing toolbox documentation in Matlab Help menu.
The examples given there will guide you to the proper functions to call and their various formats.
Sample CODE (not mine)