OCR 和区分 2 种或 3 种字体

发布于 2024-11-27 02:28:19 字数 813 浏览 6 评论 0原文

假设我有一个文档的黑白图像，仅使用 2 或 3 种字体。三个字体中的一个用于标题，另一个是小字体（或者至少非常简单）。例如，一小段文本可能是：

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

我不需要一个大的花哨的 OCR 系统来告诉我“Best Soup”使用带有斜体等的特殊花哨字体。我只需要一个系统可以告诉我“最佳汤”的格式与“自来水”相当不同，“最佳汤”和“三明治”可能使用相同的格式，并且“三明治”比“自来水”更大/更奇特水。”

我将使用 Tesseract 进行实际的 OCR 和边界框检测 (http://www.mail-archive.com/[电子邮件受保护]/msg02157.html），如果相关的话。

有什么东西可以用来做这个简单的格式化分类吗？

编辑：

有什么东西可以做到这一点而不花费我一条胳膊和一条腿吗？

原文

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."

I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/[email protected]/msg02157.html), if that's relevant.

Is there anything out there that I can use to do this simple formatting classification?

Edit:

Is there anything out there that will do this without costing me an arm and a leg?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷情 2024-12-04 02:28:19

我不确定 tesseract 是否可以解决您描述的任务，但我相信好的 ocr 引擎应该检测字体样式。例如，ABBYY OCR SDK不仅可以识别粗体/斜体字体样式，还可以定义适当的字体样式输出中使用的字体。

根据您的描述，我猜您正在尝试确定文档样式层次结构，例如标题级别等。ABBYY FineReader Engine 提供此功能，并且您无需参与基于字体大小和样式的文本用途例程。此外，它提供最好的 ocr 质量并且可以免费试用。如果您计划使用商业软件，请考虑尝试一下。我在 ABBYY 工作，如有必要，可以为您提供我们的 OCR SDK 的更多信息。

此致。

回复收藏 0 原文

~没有更多了~