当前位置：文江博客话题详情

ocr

OCR 软件能否可靠地从表格中读取值？

发布于 2024-11-10 17:16:37 字数 424 浏览 5 评论 0 原文

OCR 软件能否可靠地将如下图像转换为值列表？

更新：

更详细的任务如下：

我们有一个客户端应用程序，用户可以在其中打开报告。该报告包含一个值表。但并非每个报告看起来都一样 - 不同的字体、不同的间距、不同的颜色，也许报告包含许多行/列数不同的表格...

用户选择包含表格的报告区域。使用鼠标。

现在我们想要使用我们的 OCR 工具将选定的表转换为值。

当用户选择矩形区域时，我可以询问额外的信息帮助进行 OCR 流程，并要求确认这些值已被正确识别。

它最初将是一个实验性项目，因此很可能使用开源 OCR 工具，或者至少是一个不需要花费任何金钱用于实验目的的工具。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

弃爱 2024-11-17 17:16:37

简单的答案是肯定的，您应该选择正确的工具。

我不知道开源是否可以在这些图像上获得接近 100% 的准确性，但根据这里的答案，如果您花一些时间进行培训并解决表分析问题和类似的问题，那么可能是的。

当我们谈论像 ABBYY 或其他这样的商业 OCR 时，它将为您提供开箱即用的 99% 以上的准确度，并且它会自动检测表格。无需培训，无需任何操作，即可工作。缺点是你必须支付$$。有些人会反对，对于开源，你需要花费时间来设置和维护它 - 但每个人都在这里自己决定。

但如果我们谈论商业工具，实际上有更多的选择。这取决于你想要什么。 FineReader 等盒装产品实际上旨在将输入文档转换为 Word 或 Excell 等可编辑文档。由于您实际上想要获取数据，而不是 Word 文档，因此您可能需要查看不同的产品类别 - 数据捕获，它本质上是 OCR 加上一些附加逻辑来在页面上查找必要的数据。如果是发票，则可以是公司名称、总金额、到期日、表中的行项目等。

数据捕获是复杂的主题，需要一些学习，但正确使用可以在从文档中捕获数据时保证准确性。它使用不同的规则进行数据交叉检查、数据库查找等。必要时它可以发送数据进行人工验证。企业广泛使用数据捕获应用程序每月输入数百万份文档，并严重依赖在日常工作流程中提取的数据。

当然，还有 OCR SDK，它可以让您通过 API 访问识别结果，并且您将能够编程如何处理数据。

如果您更详细地描述您的任务，我可以为您提供更容易走的方向的建议。

更新

因此，您所做的基本上是数据捕获应用程序，但不是完全自动化，使用所谓的“点击索引”方法。市场上有许多类似的应用程序：您扫描图像，操作员单击图像上的文本（或在其周围绘制矩形），然后将字段填充到数据库中。当要处理的图像数量相对较少，并且手动工作量不足以证明全自动应用程序的成本合理时，这是一个很好的方法（是的，有全自动系统可以处理具有不同字体、间距、布局、数量的图像）表中的行等）。

如果您决定开发东西而不是购买，那么您需要的就是选择 OCR SDK。所有的 UI 都是你自己写的，对吗？重大选择是决定：开源还是商业。

据我所知，最好的开源是 tesseract OCR。它是免费的，但可能在表分析方面存在实际问题，但使用手动分区方法这不应该是问题。至于 OCR 准确性 - 人们经常训练 OCR 字体以提高准确性，但您不应该出现这种情况，因为字体可能不同。因此，您可以尝试使用 tesseract，看看您会得到什么准确度 - 这将影响纠正它的手动工作量。

商业 OCR 会提供更高的准确性，但会花费您金钱。我认为无论如何你都应该看看它是否值得，或者 tesserack 对你来说已经足够好了。我认为最简单的方法是下载一些 OCR 产品（例如 FineReader）的试用版。然后您就会清楚 OCR SDK 的准确性是多少。

Simple answer is YES, you should just choose right tools.

I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.

When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.

However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.

Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.

And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.

If you describe your task in more detail I can provide you with advice what direction is easier to go.

UPDATE

So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).

If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.

Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.

Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.

回复收藏 0 原文

笔落惊风雨 2024-11-17 17:16:37

如果您的表格中始终有实心边框，您可以尝试以下解决方案：

找到每页上的水平线和垂直线（长行）
黑色像素）

使用线坐标将图像分割成单元格

清理每个单元格（删除边框、将阈值设置为黑白）

对每个单元格执行 OCR

将结果组装到二维数组中

否则您的文档有一个无边框表格，您可以尝试遵循这一行：

光学字符识别是非常神奇的东西，但事实并非如此
总是完美的。为了获得最佳结果，它有助于使用
最干净的输入。在我最初的实验中，我发现
对整个文档执行 OCR 实际上效果很好
只要我删除了单元格边框（长水平和垂直
行）。然而，该软件将所有空白压缩为一个
空的空间。由于我的输入文档有多列
每列中有几个单词，单元格边界正在消失。
保留细胞之间的关系非常重要，所以一
可能的解决方案是在每个字符上绘制一个唯一的字符，例如“^”
单元格边界 – OCR 仍然会识别的东西，我
稍后可以用来分割结果字符串。

我在这个链接中找到了所有这些信息，询问谷歌“OCR to table”。作者发布了使用 Python 和 Tesseract 的完整算法，两者都是开源解决方案！

如果您想尝试 Tesseract 的功能，也许您应该尝试这个网站：

http://www.free-ocr。 com/

回复收藏 0 原文

dawn曙光 2024-11-17 17:16:37

您正在谈论哪种 OCR？
您会基于该 OCR 开发代码还是会使用现成的东西？

供参考：
Tesseract OCR

它已经实现了文档读取可执行文件，因此您可以输入整个页面，它会为您提取字符。它可以很好地识别空格，它可能有助于制表符间距。

回复收藏 0 原文

余生再见 2024-11-17 17:16:37

自 98 年以来，我一直在对扫描文档进行 OCR 处理。对于扫描文档来说，这是一个反复出现的问题，特别是对于那些包含旋转和/或倾斜页面的文档。

是的，有一些很好的商业系统，有些一旦配置良好，就可以提供出色的自动数据挖掘率，仅在那些严重退化的领域寻求操作员的帮助。如果我是你，我会依赖其中的一些人。

如果商业选择威胁到您的预算，OSS 可以伸出援手。但是，“天下没有免费的午餐”。因此，您必须依靠一堆定制的脚本来搭建一个经济实惠的解决方案来处理您的大量文档。幸运的是，你并不孤单。事实上，在过去的几十年里，很多人一直在处理这个问题。因此，恕我直言，本文提供了这个问题的最佳和简洁的答案：

https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

值得一读！作者提供了他自己的有用工具，但文章的结论非常重要，可以让您对如何解决此类问题有一个良好的心态。

“没有灵丹妙药。”
（Fred Brooks，Mitical 人月）

回复收藏 0 原文

沉默的熊 2024-11-17 17:16:37

这实际上取决于实施。

有几个参数会影响 OCR 的识别能力：
1. OCR 训练得如何 - 示例数据库的大小和质量
2. 它在检测“垃圾”方面的训练效果如何（除了知道什么是字母之外，您还需要知道什么不是字母）。
3. OCR的设计和类型
4. 如果是神经网络，神经网络结构会影响其学习和“决策”的能力。

因此，如果您不自己制作一种，只需测试不同的种类，直到找到合适的。

回复收藏 0 原文

誰認得朕 2024-11-17 17:16:37

你可以尝试其他方法。使用 tesseract（或其他 OCRS），您可以获得每个单词的坐标。然后您可以尝试按垂直和水平坐标对这些单词进行分组以获得行/列。例如，区分空格和制表符空格。需要一些练习才能获得好的结果，但这是可能的。使用此方法，即使表格使用不可见的分隔符（没有线条），您也可以检测表格。单词坐标是表格识别的坚实基础

回复收藏 0 原文

淡紫姑娘！ 2024-11-17 17:16:37

我们还一直在努力解决识别表格中文本的问题。有两种开箱即用的解决方案：ABBYY Recognition Server 和 ABBYY FlexiCapture。 Rec Server 是一款基于服务器的大容量 OCR 工具，旨在将大量文档转换为可搜索格式。尽管它具有适用于这些类型用途的 API，但我们推荐 FlexiCapture。 FlexiCapture 提供对从表格格式中提取数据的低级别控制，包括自动检测页面上的表格项目。它有没有前端的完整 API 版本，或者我们销售的现成版本。如果您想了解更多信息，请联系我。

回复收藏 0 原文

小…红帽 2024-11-17 17:16:37

以下是对我有用的基本步骤。需要的工具包括 Tesseract、Python、OpenCV，以及 ImageMagick（如果您需要执行任何操作）旋转图像以纠正倾斜。

使用 Tesseract 检测旋转并使用 ImageMagick mogrify 修复它。
使用 OpenCV 查找并提取表格。
使用 OpenCV 从表中查找并提取每个单元格。
使用 OpenCV 裁剪和清理每个单元格，这样就不会有干扰 OCR 软件的噪音。
使用 Tesseract 对每个单元格进行 OCR。
将每个单元格提取的文本合并为您需要的格式。

每个步骤的代码都很丰富，但如果您想使用 python 包，则非常简单，如下所示。

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

该包和演示模块会将下表转换为 CSV 输出。

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

如果您需要进行任何更改以使代码适用于不同宽度的表格边框，https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.

Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.

The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That package and demo module will turn the following table into CSV output.

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

回复收藏 0 原文

~没有更多了~