有没有一种有效的手写文本分割算法?
我想自动按行(以及将来按单词)划分古代手写文本的图像。
第一个明显的部分是预处理图像......
我只是使用简单的数字化(基于像素的亮度)。之后我将数据存储到二维数组中。
下一个明显的部分是分析二进制数组。
我的第一个算法非常简单 - 如果数组的一行中黑色像素的数量多于最大值和最小值值的均方根,那么此行是行的一部分。
形成行列表后,我剪掉了高度低于平均水平的行。 最后它变成了某种线性回归,试图最小化空白行和文本行之间的差异。 (我假设了这个事实)
我的第二次尝试 - 我尝试将 GA 与多个健身函数结合使用。 染色体包含 3 个值 - xo、x1、x2。 xo [-1;0] x1 [0;0.5] x2 [0;0.5]
函数,确定行与行的同一性(xo + α1 x1 + α2 x2) > 0,其中α1是行中黑色像素的缩放总和,α2是行中极端黑色像素之间的范围的中值。 (a1,a2[0,1]) 我尝试过的另一个函数是 (x1 < α1 OR x2 > α2) 和 (1/xo + [a1 x1] / [a2 x2] ) > 0 最后一个函数是最有效的。 适应度函数为 (1 / (HeigthRange + SpacesRange)
其中 range 是最大值和最小值之间的差值。它代表文本的同质性。该函数的全局最优 - 将图像划分为线条的最平滑方式
。I我正在使用 C# 和我的自编码 GA(经典,具有 2 点交叉,格雷码染色体,最大群体为 40,突变率为 0.05)
现在我不知道如何划分它 准确度约为 100%,
执行此操作的有效算法是什么?
图像成线的 原始 BMP (1.3 MB)
更新2: 将此文本的结果改进至 100%
我是如何做到的:
- 修复了范围计数中的小错误
- ,将健身函数更改为 1/ (distancesRange+1)*(heightsRange+1))
- 最小化分类函数为(1/xo + x2/range)> 0(行中的点现在不影响分类) (即优化输入数据并使适应度函数优化更加明确)
问题:
令人惊讶的是,GA 未能识别这条线。我查看了“查找愤怒”功能的调试数据,发现“无法识别”的地方有太多噪音。 函数代码如下:
public double[] Ranges()
{
var ranges = new double[_original.Height];
for (int y = 0; y < _original.Height; y++ )
{
ranges[y] = 0;
var dx = new List<int>();
int last = 0;
int x = 0;
while (last == 0 && x<_original.Width)
{
if (_bit[x, y])
last = x;
x++;
}
if (last == 0)
{
ranges[y] = 0;
continue;
}
for (x = last; x<_original.Width; x++)
{
if (!_bit[x, y]) continue;
if (last != x - 1)
{
dx.Add((x-last)+1);
}
last = x;
}
if (dx.Count > 2)
{
dx.Sort();
ranges[y] = dx[dx.Count / 2];
//ranges[y] = dx.Average();
}
else
ranges[y] = 0;
}
var maximum = ranges.Max();
for (int i = 0; i < ranges.Length; i++)
{
if (Math.Abs(ranges[i] - 0) < 0.9)
ranges[i] = maximum;
}
return ranges;
}
我在这段代码中使用了一些技巧。主要原因 - 我想最小化最近的黑色像素之间的范围,但如果没有像素,则该值变为“0”,并且不可能通过找到最佳值来解决这个问题。第二个原因 - 该代码更改得太频繁。 我会尝试完全更改此代码,但我不知道该怎么做。
问:
- 是否有更高效的健身功能?
- 如何找到更加通用的判定函数?
I want to automatically divide an image of ancient handwritten text by lines (and by words in future).
The first obvious part is preprocessing the image...
I'm just using a simple digitization (based on brightness of pixel). After that I store data into two-dimensional array.
The next obvious part is analyzing the binary array.
My first algorithm was pretty simple - if there are more black pixels in a row of the array than the root-mean-square of Maximum and Minimum value, then this row is part of line.
After forming the list of lines I cut off lines with height that is less than average.
Finally it turned out into some kind of linear regression, trying to minimize the difference between the blank rows and text rows. (I assumed that fact)My second attempt - I tried to use GA with several fitness functions.
The chromosome contained 3 values - xo, x1, x2. xo [-1;0] x1 [0;0.5] x2 [0;0.5]
Function, that determines identity the row to line is (xo + α1 x1 + α2 x2) > 0, where α1 is scaled sum of black pixels in row, α2 is median value of ranges between the extreme black pixels in row. (a1,a2 [0,1])
Another functions, that I tried is (x1 < α1 OR x2 > α2) and (1/xo + [a1 x1] / [a2 x2] ) > 0
The last function is the most efficient.
The fitness function is
(1 / (HeigthRange + SpacesRange)
Where range is difference between maximum and minimum. It represents the homogeneity of text. The global optimum of this function - the most smooth way to divide the image into lines.
I am using C# with my self-coded GA (classical, with 2-point crossover, gray-code chromosomes, maximum population is 40, mutation rate is 0.05)
Now I ran out of ideas how to divide this image into lines with ~100% accuracy.
What is the efficient algorithm to do this?
UPDATE:
Original BMP (1.3 MB)
UPDATE2:
Improved results on this text to 100%
How I did it:
- fixed minor bug in range count
- changed fitness function to 1/(distancesRange+1)*(heightsRange+1))
- minimized classifying function to (1/xo + x2/range) > 0 (points in row now don't affect classification)
(i.e. optimized input data and made fitness function optimizations more explicit)
Problem:
GA surprisingly failed to recognize this line. I looked at debug data of 'find rages' function and found, that there is too much noise in 'unrecognized' place.
The function code is below:
public double[] Ranges()
{
var ranges = new double[_original.Height];
for (int y = 0; y < _original.Height; y++ )
{
ranges[y] = 0;
var dx = new List<int>();
int last = 0;
int x = 0;
while (last == 0 && x<_original.Width)
{
if (_bit[x, y])
last = x;
x++;
}
if (last == 0)
{
ranges[y] = 0;
continue;
}
for (x = last; x<_original.Width; x++)
{
if (!_bit[x, y]) continue;
if (last != x - 1)
{
dx.Add((x-last)+1);
}
last = x;
}
if (dx.Count > 2)
{
dx.Sort();
ranges[y] = dx[dx.Count / 2];
//ranges[y] = dx.Average();
}
else
ranges[y] = 0;
}
var maximum = ranges.Max();
for (int i = 0; i < ranges.Length; i++)
{
if (Math.Abs(ranges[i] - 0) < 0.9)
ranges[i] = maximum;
}
return ranges;
}
I'm using some hacks in this code. The main reason - I want to minimize the range between nearest black pixels, but if there are no pixels, the value becomes '0', and it becomes impossible to solve this problem with finding optimas. The second reason - this code is changing too frequently.
I'll try to fully change this code, but I have no idea how to do it.
Q:
- If there is more efficient fitness function?
- How to find more versatile determination function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
虽然我不确定如何将以下算法转换为 GA(并且我不确定为什么需要使用 GA 来解决这个问题),而且我的提议可能有偏差,但这里是。
我建议的简单技术是计算每行黑色像素的数量。 (实际上是每行的暗像素密度。)这需要很少的操作,并且通过一些额外的计算,在像素和直方图中找到峰值并不困难。
原始直方图看起来像这样,其中左侧的轮廓显示一行中暗像素的数量。为了可见性,实际计数被归一化为 x = 200。
经过一些额外的简单处理后添加(如下所述),我们可以生成这样的直方图,可以在某个阈值处进行裁剪。剩下的是指示文本行中心的峰值。
从那里找到线条是一件简单的事情:只需将直方图剪辑(阈值)为某个值,例如最大值的 1/2 或 2/3,并可选择检查限幅阈值处的峰宽是否为某个最小值 w。
找到更好的直方图的完整(但仍然简单!)算法的一种实现如下:
“垂直计数”(步骤 3)消除了恰好位于文本中心线上方或下方的水平笔画。更复杂的算法将直接检查 (x,y) 的上方和下方,但也会检查左上、右上、左下和右下。
通过我在 C# 中相当粗糙的实现,我能够在不到 75 毫秒的时间内处理图像。在 C++ 中,通过一些基本的优化,我毫不怀疑时间可以大大减少。
此直方图方法假设文本是水平的。由于该算法相当快,因此您可能有足够的时间以水平方向每 5 度的增量计算像素计数直方图。具有最大峰/谷差异的扫描方向将指示旋转。
我不熟悉 GA 术语,但如果我的建议有一定价值,我相信您可以将其翻译成 GA 术语。不管怎样,反正我对这个问题很感兴趣,所以不妨分享一下。
编辑:也许对于使用GA,最好考虑“自X中前一个暗像素以来的距离”(或沿角度theta)和“自Y中前一个暗像素以来的距离”(或沿角度[theta - pi/2]) )。您还可以检查所有径向方向上从白色像素到暗像素的距离(以查找循环)。
Although I'm not sure how to translate the following algorithm into GA (and I'm not sure why you need to use GA for this problem), and I could be off base in proposing it, here goes.
The simple technique I would propose is to count the number of black pixels per row. (Actually it's the dark pixel density per row.) This requires very few operations, and with a few additional calculations it's not difficult to find peaks in the pixel-sum histogram.
A raw histogram will look something like this, where the profile along the left side shows the number of dark pixels in a row. For visibility, the actual count is normalized to stretch out to x = 200.
After some additional, simple processing is added (described below), we can generate a histogram like this that can be clipped at some threshold value. What remains are peaks indicating the center of lines of text.
From there it's a simple matter to find the lines: just clip (threshold) the histogram at some value such as 1/2 or 2/3 the maximum, and optionally check that the width of the peak at your clipping threshold is some minimum value w.
One implementation of the full (yet still simple!) algorithm to find the nicer histogram is as follows:
The "vertical count" (step 3) eliminates horizontal strokes that happen to be located above or below the center line of text. A more sophisticated algorithm would just check directly above and below (x,y), but also to the upper left, upper right, lower left, and lower right.
With my rather crude implementation in C# I was able to process the image in less than 75 milliseconds. In C++, and with some basic optimization, I've little doubt the time could be cut down considerably.
This histogram method assumes the text is horizontal. Since the algorithm is reasonably fast, you may have enough time to calculate pixel count histograms at increments of every 5 degrees from the horizontal. The scan orientation with the greatest peak/valley differences would indicate the rotation.
I'm not familiar with GA terminology, but if what I've suggested is of some value I'm sure you can translate it into GA terms. In any case, I was interested in this problem anyway, so I might as well share.
EDIT: maybe for use GA, it's better to think in terms of "distance since previous dark pixel in X" (or along angle theta) and "distance since previous dark pixel in Y" (or along angle [theta - pi/2]). You might also check distance from white pixel to dark pixel in all radial directions (to find loops).
摆弄了一段时间后,我发现我只需要计算每条线的交叉次数,即从白到黑的转换算作1,从黑到白的转换又加1。通过用 count > 突出显示每一行66 除了最底线之外,我的准确率接近 100%。
当然,对于稍微旋转的扫描文档来说并不稳健。缺点是需要确定正确的阈值。
After fiddling around this for a while I found that I simply need to count the number of crossings for each line, that is, a switch from white to black would count as one, and a switch from black to white would increment by one again. By highlighting each line with a count > 66 I got close to 100% accuracy, except for the bottom most line.
Of course, would not be robust to slightly rotated scanned documents. And there is this disadvantage of needing to determine the correct threshold.
想法1:
制作您自己的 ReCaptcha 版本(放在您自己的代言网站上) - 并使其成为一个有趣的游戏..“就像剪下一个单词(边缘应该全部是空白 - 对上下行重叠字符有一定的容忍度)”。
想法2:
这是我们小时候玩的游戏,衣架上的电线全部弯曲成波浪形并连接到蜂鸣器,你必须操纵一根末端有环的魔杖,电线穿过它,从一侧到另一侧而不使蜂鸣器响起。也许你可以适应这个想法并制作一款手机游戏,人们可以在不接触黑色文本的情况下绘制线条(允许重叠字符)......当他们可以画出一条线时,他们会获得分数并达到新的水平,你可以给他们更多的难度图像..
想法 3:
研究 google/recaptcha 如何绕过它
想法 4:
获取 Photoshop SDK 并掌握其提取边缘工具的功能
想法 5:
在 Y 轴上拉伸图像堆,这应该有助于应用算法,然后减少位置测量并将其应用到正常大小的图像上。
Idea 1:
Make your own version of ReCaptcha (to put on your very own pron site) - and make it a fun game.. "Like cut out a word (edges should all be white space - with some tolerance for overlapping chars on above and below lines)."
Idea 2:
This was a game we played as kids, the wire of a coat hanger was all bent in waves and connected to a buzzer and you had to navigate a wand with a ring in the end with the wire through it, across one side to the other without making the buzzer go off. Perhaps you could adapt this idea and make a mobile game where people trace out the lines without touching black text (with tolerance for overlapping chars)... when they can do a line they get points and get to new levels where you give them harder images..
Idea 3:
Research how google/recaptcha got around it
Idea 4:
Get the SDK for photoshop and master the functionality of it Extract Edges tool
Idea 5:
Stretch the image heaps on the Y Axis which should help, apply the algorithm, then reduce the location measurements and apply them on the normal sized image.