从单词位置检测文本列
我有一个 tiff 文件及其上的文本,该文件已在早期阶段进行 OCR 处理。这些单词具有作为信息的确切位置(左上、右下)。我现在需要读取用户绘制的矩形内的文本。
普通段落没有问题,但我不知道应该如何处理文本列。如果有两个相邻的段落,则简单地将行视为单行将使结果无法使用。
是否有算法可以帮助我按正确的顺序排列单词?我猜我必须检查单词之间的空格来检测识别列的模式。我想避免直接处理图像,尽管它应该是可能的(但没有 OCR)。
我也不确定列表/表格的影响,例如订单和表格的影响。账单。在这里,面向线路的方法可能会更好。
我正在使用 Delphi 进行开发,但其他语言的适应性算法也将受到赞赏。
编辑:我将尝试明天发布示例数据,但基本上我有一个单词数组,它们在图像上各自的坐标(例如,我可以轻松地在它们周围画一个矩形)。
I have a tiff file and the text on it, which has been OCR'd at an earlier stage. The words have their exact positions as information (upper left, lower right). I now need to read the text within a user-drawn rectangle.
Normal paragraphs are no problem, but I don't know how I should handle text columns. If there are two paragraphs next to each other, simply taking the row as a single line would make the result unusable.
Are there algorithms to help me put the words in the right order? I'm guessing that I have to examine the spaces between words to detect patterns that identify columns. I would like to avoid processing the image directly, although it should be possible (but no OCR).
I am also unsure about the influence of lists/tables, e.g. in orders & bills. A line-orientated approach would probably be better here.
I am developing in Delphi, but adaptable algorithms in other languages would also be appreciated.
edit: I will try to post sample data tomorrow, but basically I have an Array of Words, with their respective coordinates on the image (I could easily draw a rectangle around them, for example).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设您的原始文本分为两列,如下所示:
根据您的描述,听起来您的 OCR 已为您提供了各个单词及其边界矩形。如果正交扫描原始页面,则给定行上的所有单词应具有相同(或非常接近)的 y 值。如果它们不完全相同,您可以在垂直位置上使用典型框高度的一部分进行整数除法。这应该对 y 值进行聚类。您可以对 x 坐标进行类似的处理,以确保列边缘的单词也具有相同的 x 值。
为了检测单独的列,我会尝试制作所有单词的所有“左”值的直方图(如果文本从右到左运行,则为右边缘)。您应该在每列的开头看到一个峰值。
您可以通过确保在每行上候选列开始之前的最后一个框的右坐标之间存在间隙来排除任何误报。间隙可能至少应与任何单词的最小宽度一样大。
然后,您可以通过检查单词的左右坐标所属的水平范围将单词划分为列组。在我们的示例中,从
Aaaa
到lll
的单词将最终出现在第一个分区中,而从mmmm
到uu 的单词将最终出现在第一个分区中。 /code> 最终会出现在第二个分区中。
在每个分区内,您可以通过按 y 坐标排序来进行在线分区。最后,对于每一行,您都根据 x 坐标进行排序。 (是否按升序或降序排序取决于您的坐标系和文本流动的方向。)
相同的基本思想可以应用于表格和文本列,但您可能需要一些调整来处理诸如右-之类的事情合理的细胞。
Suppose your original text is in two columns like this:
From your description, it sounds like your OCR has given you the individual words and their bounding rectangles. If the original page is scanned orthogonally, then all of the words on a given line should have the same (or very close) y values. If they're not exactly the same, you can do an integer division on the vertical positions with some fraction of a typical box height. That should cluster the y values. You can do similar processing on the x coordinates to ensure that words at the edge of a column also have identical x values.
To detect the separate columns, I'd try making a histogram of all the "left" values of all the words (or right edges if your text runs right-to-left). You should see a peak at the beginning of each column.
You can probably rule out any false positives by ensuring that, on every line, there is a gap between the right coordinate of the last box before the candidate start of a column. The gap should probably be at least as large as the smallest width of any word.
You can then partition your words up into column groups by checking which horizontal range their left and right coordinates fall in to. In our example, the words from
Aaaa
throughlll
would end up in the first partition and the words frommmmm
throughuu.
would end up in the second partition.Within each partition, you can then partition on line by sorting on the y coordinates. Finally, for each line, you sort on the x coordinate. (Whether you sort on ascending or descending depends on your coordinate system and the direction your text flows.)
The same basic idea could be applied to tables as well as columns of text, but you might need some tweaks to deal with things like right-justified cells.