Google 图书如何查找文本区域?
计算机视觉中的一个具有挑战性的主题是处理文档扫描。 通常,这涉及许多步骤,例如噪声去除、颜色分析、二值化、文本块识别、OCR,然后可能还包括一些上下文分析和校正。
我很好奇是否有人理解、知道或可以向我指出有关 Google 如何在 OCR 阶段之前识别文本块的文献。 有什么见解吗?
One challenging topic in computer vision is processing document scans. Typically this involves a number of steps, like noise removal, color analysis, binarization, text block identification, OCR, and then maybe some context analysis and correction.
I'm curious if anyone understands, knows or can point me to literature on how Google identifies text blocks prior to the OCR stage. Any insights?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我相信 Google 将 Tesseract OCR 引擎与另一个名为 Ocropus,两者都是开源的。 我不知道它们是如何工作的,但您可能有兴趣查看上面的链接中提供的代码。
I believe Google uses the Tesseract OCR engine in conjunction with another tool called Ocropus, both of which are open-source. I don't know anything about how they work but you may be interested in checking out the code, available at the above links.
这是来自我图书馆的数字化专家的二手信息,但谷歌的方法似乎是将所有内容都通过自动化流程,OCR任何看起来像文本的内容,而不是过多地裁剪单个图像或做太多语义处理他们可能会做一些不明显的微妙事情,但从表面上看,他们肯定是追求数量而不是质量,在我看来,这对他们来说是明智的做法。
This is second-hand information from the digitization specialist in my library, but it seems that Google's approach is to just throw everything through the automated process, ocr anything that looks like text and to not fuss too much about cropping individual images or doing much semantic analasys to look for image captions, etc. They may be doing subtle things that aren't obvious but on the surface they are definitely gunning for quantity over quality, which is smart for them to do for their purposes, IMO.