OCR 和区分 2 种或 3 种字体

发布于 2024-11-27 02:28:19 字数 813 浏览 6 评论 0原文

假设我有一个文档的黑白图像,仅使用 2 或 3 种字体。三个字体中的一个用于标题,另一个是小字体(或者至少非常简单)。例如,一小段文本可能是:

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

我不需要一个大的花哨的 OCR 系统来告诉我“Best Soup”使用带有斜体等的特殊花哨字体。我只需要一个系统可以告诉我“最佳汤”的格式与“自来水”相当不同,“最佳汤”和“三明治”可能使用相同的格式,并且“三明治”比“自来水”更大/更奇特水。”

我将使用 Tesseract 进行实际的 OCR 和边界框检测 (http://www.mail-archive.com/[电子邮件受保护]/msg02157.html),如果相关的话。

有什么东西可以用来做这个简单的格式化分类吗?

编辑:

有什么东西可以做到这一点而不花费我一条胳膊和一条腿吗?

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."

I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/[email protected]/msg02157.html), if that's relevant.

Is there anything out there that I can use to do this simple formatting classification?

Edit:

Is there anything out there that will do this without costing me an arm and a leg?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冷情 2024-12-04 02:28:19

我不确定 tesseract 是否可以解决您描述的任务,但我相信好的 ocr 引擎应该检测字体样式。例如,ABBYY OCR SDK不仅可以识别粗体/斜体字体样式,还可以定义适当的字体样式输出中使用的字体。

根据您的描述,我猜您正在尝试确定文档样式层次结构,例如标题级别等。ABBYY FineReader Engine 提供此功能,并且您无需参与基于字体大小和样式的文本用途例程。此外,它提供最好的 ocr 质量并且可以免费试用。如果您计划使用商业软件,请考虑尝试一下。我在 ABBYY 工作,如有必要,可以为您提供我们的 OCR SDK 的更多信息。

此致。

I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.

Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work @ ABBYY and can provide you more info our OCR SDK if necessary.

Best regards.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文