OCR 软件能否可靠地从表格中读取值?

发布于 2024-11-10 17:16:37 字数 424 浏览 5 评论 0 原文

OCR 软件能否可靠地将如下图像转换为值列表?
值表

更新:

更详细的任务如下:

我们有一个客户端应用程序,用户可以在其中打开报告。该报告包含一个值表。 但并非每个报告看起来都一样 - 不同的字体、不同的间距、不同的颜色,也许报告包含许多行/列数不同的表格...

用户选择包含表格的报告区域。使用鼠标。

现在我们想要使用我们的 OCR 工具将选定的表转换为值。

当用户选择矩形区域时,我可以询问额外的信息 帮助进行 OCR 流程,并要求确认这些值已被正确识别。

它最初将是一个实验性项目,因此很可能使用开源 OCR 工具,或者至少是一个不需要花费任何金钱用于实验目的的工具。

Would OCR Software be able to reliably translate an image such as the following into a list of values?
Table of values

UPDATE:

In more detail the task is as follows:

We have a client application, where the user can open a report. This report contains a table of values.
But not every report looks the same - different fonts, different spacing, different colors, maybe the report contains many tables with different number of rows/columns...

The user selects an area of the report which contains a table. Using the mouse.

Now we want to convert the selected table into values - using our OCR tool.

At the time when the user selects the rectangular area I can ask for extra information
to help with the OCR process, and ask for confirmation that the values have been correct recognised.

It will initially be an experimental project, and therefore most likely with an OpenSource OCR tool - or at least one that does not cost any money for experimental purposes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

弃爱 2024-11-17 17:16:37

简单的答案是肯定的,您应该选择正确的工具。

我不知道开源是否可以在这些图像上获得接近 100% 的准确性,但根据这里的答案,如果您花一些时间进行培训并解决表分析问题和类似的问题,那么可能是的。

当我们谈论像 ABBYY 或其他这样的商业 OCR 时,它将为您提供开箱即用的 99% 以上的准确度,并且它会自动检测表格。无需培训,无需任何操作,即可工作。缺点是你必须支付$$。有些人会反对,对于开源,你需要花费时间来设置和维护它 - 但每个人都在这里自己决定。

但如果我们谈论商业工具,实际上有更多的选择。这取决于你想要什么。 FineReader 等盒装产品实际上旨在将输入文档转换为 Word 或 Excell 等可编辑文档。由于您实际上想要获取数据,而不是 Word 文档,因此您可能需要查看不同的产品类别 - 数据捕获,它本质上是 OCR 加上一些附加逻辑来在页面上查找必要的数据。如果是发票,则可以是公司名称、总金额、到期日、表中的行项目等。

数据捕获是复杂的主题,需要一些学习,但正确使用可以在从文档中捕获数据时保证准确性。它使用不同的规则进行数据交叉检查、数据库查找等。必要时它可以发送数据进行人工验证。企业广泛使用数据捕获应用程序每月输入数百万份文档,并严重依赖在日常工作流程中提取的数据。

当然,还有 OCR SDK,它可以让您通过 API 访问识别结果,并且您将能够编程如何处理数据。

如果您更详细地描述您的任务,我可以为您提供更容易走的方向的建议。

更新

因此,您所做的基本上是数据捕获应用程序,但不是完全自动化,使用所谓的“点击索引”方法。市场上有许多类似的应用程序:您扫描图像,操作员单击图像上的文本(或在其周围绘制矩形),然后将字段填充到数据库中。当要处理的图像数量相对较少,并且手动工作量不足以证明全自动应用程序的成本合理时,这是一个很好的方法(是的,有全自动系统可以处理具有不同字体、间距、布局、数量的图像)表中的行等)。

如果您决定开发东西而不是购买,那么您需要的就是选择 OCR SDK。所有的 UI 都是你自己写的,对吗?重大选择是决定:开源还是商业。

据我所知,最好的开源是 tesseract OCR。它是免费的,但可能在表分析方面存在实际问题,但使用手动分区方法这不应该是问题。至于 OCR 准确性 - 人们经常训练 OCR 字体以提高准确性,但您不应该出现这种情况,因为字体可能不同。因此,您可以尝试使用 tesseract,看看您会得到什么准确度 - 这将影响纠正它的手动工作量。

商业 OCR 会提供更高的准确性,但会花费您金钱。我认为无论如何你都应该看看它是否值得,或者 tesserack 对你来说已经足够好了。我认为最简单的方法是下载一些 OCR 产品(例如 FineReader)的试用版。然后您就会清楚 OCR SDK 的准确性是多少。

Simple answer is YES, you should just choose right tools.

I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.

When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.

However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.

Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.

And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.

If you describe your task in more detail I can provide you with advice what direction is easier to go.

UPDATE

So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).

If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.

Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.

Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.

笔落惊风雨 2024-11-17 17:16:37

如果您的表格中始终有实心边框,您可以尝试以下解决方案:

  1. 找到每页上的水平线和垂直线(长行)
    黑色像素)
  2. 使用线坐标将图像分割成单元格
  3. 清理每个单元格(删除边框、将阈值设置为黑白)
  4. 对每个单元格执行 OCR
  5. 将结果组装到二维数组中

否则您的文档有一个无边框表格,您可以尝试遵循这一行:

光学字符识别是非常神奇的东西,但事实并非如此
总是完美的。为了获得最佳结果,它有助于使用
最干净的输入。在我最初的实验中,我发现
对整个文档执行 OCR 实际上效果很好
只要我删除了单元格边框(长水平和垂直
行)。然而,该软件将所有空白压缩为一个
空的空间。由于我的输入文档有多列
每列中有几个单词,单元格边界正在消失。
保留细胞之间的关系非常重要,所以一
可能的解决方案是在每个字符上绘制一个唯一的字符,例如“^”
单元格边界 – OCR 仍然会识别的东西,我
稍后可以用来分割结果字符串。

我在这个链接中找到了所有这些信息,询问谷歌“OCR to table”。作者发布了 使用 Python 和 Tesseract 的完整算法,两者都是开源解决方案!

如果您想尝试 Tesseract 的功能,也许您应该尝试这个网站:

http://www.free-ocr。 com/

If you always have solid borders in your table, you can try this solution:

  1. Locate the horizontal and vertical lines on each page (long runs of
    black pixels)
  2. Segment the image into cells using the line coordinates
  3. Clean up each cell (remove borders, threshold to black and white)
  4. Perform OCR on each cell
  5. Assemble results into a 2D array

Else your document have a borderless table, you can try to follow this line:

Optical Character Recognition is pretty amazing stuff, but it isn’t
always perfect. To get the best possible results, it helps to use the
cleanest input you can. In my initial experiments, I found that
performing OCR on the entire document actually worked pretty well as
long as I removed the cell borders (long horizontal and vertical
lines). However, the software compressed all whitespace into a single
empty space. Since my input documents had multiple columns with
several words in each column, the cell boundaries were getting lost.
Retaining the relationship between cells was very important, so one
possible solution was to draw a unique character, like “^” on each
cell boundary – something the OCR would still recognize and that I
could use later to split the resulting strings.

I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!

If you want to try the Tesseract power, maybe you should try this site:

http://www.free-ocr.com/

dawn曙光 2024-11-17 17:16:37

您正在谈论哪种 OCR?
您会基于该 OCR 开发代码还是会使用现成的东西?

供参考:
Tesseract OCR

它已经实现了文档读取可执行文件,因此您可以输入整个页面,它会为您提取字符。它可以很好地识别空格,它可能有助于制表符间距。

Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?

FYI:
Tesseract OCR

it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.

余生再见 2024-11-17 17:16:37

自 98 年以来,我一直在对扫描文档进行 OCR 处理。对于扫描文档来说,这是一个反复出现的问题,特别是对于那些包含旋转和/或倾斜页面的文档。

是的,有一些很好的商业系统,有些一旦配置良好,就可以提供出色的自动数据挖掘率,仅在那些严重退化的领域寻求操作员的帮助。如果我是你,我会依赖其中的一些人。

如果商业选择威胁到您的预算,OSS 可以伸出援手。但是,“天下没有免费的午餐”。因此,您必须依靠一堆定制的脚本来搭建一个经济实惠的解决方案来处理您的大量文档。幸运的是,你并不孤单。事实上,在过去的几十年里,很多人一直在处理这个问题。因此,恕我直言,本文提供了这个问题的最佳和简洁的答案:

https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

值得一读!作者提供了他自己的有用工具,但文章的结论非常重要,可以让您对如何解决此类问题有一个良好的心态。

“没有灵丹妙药。”
(Fred Brooks,Mitical 人月

I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.

Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.

If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:

https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.

"There is no silver bullet."
(Fred Brooks, The Mitical Man-Month)

沉默的熊 2024-11-17 17:16:37

这实际上取决于实施。

有几个参数会影响 OCR 的识别能力:
1. OCR 训练得如何 - 示例数据库的大小和质量
2. 它在检测“垃圾”方面的训练效果如何(除了知道什么是字母之外,您还需要知道什么不是字母)。
3. OCR的设计和类型
4. 如果是神经网络,神经网络结构会影响其学习和“决策”的能力。

因此,如果您不自己制作一种,只需测试不同的种类,直到找到合适的。

It really depends on implementation.

There are a few parameters that affect the OCR's ability to recognize:
1. How well the OCR is trained - the size and quality of the examples database
2. How well it is trained to detect "garbage" (besides knowing what's a letter, you need to know what is NOT a letter).
3. The OCR's design and type
4. If it's a Nerural Network, the Nerural Network structure affects its ability to learn and "decide".

So, if you're not making one of your own, it's just a matter of testing different kinds until you find one that fits.

誰認得朕 2024-11-17 17:16:37

你可以尝试其他方法。使用 tesseract(或其他 OCRS),您可以获得每个单词的坐标。然后您可以尝试按垂直和水平坐标对这些单词进行分组以获得行/列。例如,区分空格和制表符空格。需要一些练习才能获得好的结果,但这是可能的。使用此方法,即使表格使用不可见的分隔符(没有线条),您也可以检测表格。单词坐标是表格识别的坚实基础

You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog

淡紫姑娘! 2024-11-17 17:16:37

我们还一直在努力解决识别表格中文本的问题。有两种开箱即用的解决方案:ABBYY Recognition Server 和 ABBYY FlexiCapture。 Rec Server 是一款基于服务器的大容量 OCR 工具,旨在将大量文档转换为可搜索格式。尽管它具有适用于这些类型用途的 API,但我们推荐 FlexiCapture。 FlexiCapture 提供对从表格格式中提取数据的低级别控制,包括自动检测页面上的表格项目。它有没有前端的完整 API 版本,或者我们销售的现成版本。如果您想了解更多信息,请联系我。

We also have struggled with the issue of recognizing text within tables. There are two solutions which do it out of the box, ABBYY Recognition Server and ABBYY FlexiCapture. Rec Server is a server-based, high volume OCR tool designed for conversion of large volumes of documents to a searchable format. Although it is available with an API for those types of uses we recommend FlexiCapture. FlexiCapture gives low level control over extraction of data from within table formats including automatic detection of table items on a page. It is available in a full API version without a front end, or the off the shelf version that we market. Reach out to me if you want to know more.

小…红帽 2024-11-17 17:16:37

以下是对我有用的基本步骤。需要的工具包括 Tesseract、Python、OpenCV,以及 ImageMagick(如果您需要执行任何操作)旋转图像以纠正倾斜。

  1. 使用 Tesseract 检测旋转并使用 ImageMagick mogrify 修复它。
  2. 使用 OpenCV 查找并提取表格。
  3. 使用 OpenCV 从表中查找并提取每个单元格。
  4. 使用 OpenCV 裁剪和清理每个单元格,这样就不会有干扰 OCR 软件的噪音。
  5. 使用 Tesseract 对每个单元格进行 OCR。
  6. 将每个单元格提取的文本合并为您需要的格式。

每个步骤的代码都很丰富,但如果您想使用 python 包,则非常简单,如下所示。

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

该包和演示模块会将下表转换为 CSV 输出。

示例表具有 3 列和 6 行

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

如果您需要进行任何更改以使代码适用于不同宽度的表格边框,https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.

  1. Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
  2. Use OpenCV to find and extract tables.
  3. Use OpenCV to find and extract each cell from the table.
  4. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
  5. Use Tesseract to OCR each cell.
  6. Combine the extracted text of each cell into the format you need.

The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That package and demo module will turn the following table into CSV output.

example table with 3 columns and 6 rows

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文