Linux 下的 Python OCR 模块?

发布于 2024-11-03 16:48:21 字数 252 浏览 0 评论 0原文

我想在linux中找到一个易于使用的OCR python模块,我找到了pytesser http:// code.google.com/p/pytesser/,但它包含一个 .exe 可执行文件。

我尝试更改代码以使用 wine,它确实有效,但它太慢了,而且确实不是一个好主意。

有没有像它一样易于使用的 Linux 替代品?

I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.

I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.

Is there any Linux alternatives that as easy-to-use as it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

遗心遗梦遗幸福 2024-11-10 16:48:21

您可以将 tesseract 包装在函数中:

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

如果您需要文档分段和更高级的功能,请尝试 OCropus

You can just wrap tesseract in a function:

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

If you want document segmentation and more advanced features, try out OCRopus.

神仙妹妹 2024-11-10 16:48:21

除了 Blender 的答案(仅执行 Tesseract 可执行文件)之外,我想补充一点,OCR 还存在其他替代方案,也可以称为外部进程。

ABBYY 命令行 OCR 实用程序: http://ocr4linux.com/en:start

它不是免费的,因此仅当 Tesseract 精度不足以满足您的任务,或者您需要更复杂的布局分析,或者需要导出 PDF、Word 和其他文件时才值得考虑。

更新:以下是 ABBYY 和 tesseract 准确性的比较:http://www. splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

免责声明:我工作适用于泰比

In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

苏璃陌 2024-11-10 16:48:21

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
゛时过境迁 2024-11-10 16:48:21

您应该尝试使用优秀的 scikits.learn 机器学习库。您可以在此处此处

You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here.

情域 2024-11-10 16:48:21

你在这里有很多选择。

正如其他人指出的,一种方法是使用超正方体。看起来现在有一堆包装器,所以最好的方法是 对它进行快速 pypi 搜索。目前最常用的是:

另一个查找类似引擎的有用网站是 alternative.to。根据他们的说法,一些基于 Linux 的系统是:

  • ABBYY
  • Tesseract
  • CuneiForm
  • Ocropus
  • GOCR

You have a bunch of options here.

One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:

Another useful site for finding similar engines is alternative.to. A few linux based systems according to them are:

  • ABBYY
  • Tesseract
  • CuneiForm
  • Ocropus
  • GOCR
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文