tesseract 3.00 是多线程的吗?
我读到其他一些帖子建议他们在 3.00 中添加多线程支持。但我不确定3.00发布时是否添加了它。
除了多线程之外,运行 tesseract 的多个进程是实现并发的可行选择吗?
谢谢。
I read some other posts suggesting that they would add multi-threading support in 3.00. But I'm not sure if it's added in 3.00 when it was released.
Other than multi-threading, is running multiple processes of tesseract a feasible option to achieve concurrency?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我所做的一件事是调用 GNU Parallel 在多核系统上运行尽可能多的 Tess* 实例,将多页文档转换为单页图像。
这是一个简短的程序,可以在大多数 Linux 发行版上轻松编译(我使用的是 OpenSuSE 11.4)。
这是我使用的命令行:
-j 4 告诉并行使用服务器上的所有四个 CPU 核心。
如果运行此命令,并在另一个终端中执行“top”,您将同时看到最多四个进程,直到它翻查指定目录中的所有 JPG。
您的负载永远不应超过系统中 CPU 核心的数量(如果您运行 Linux)。
以下是 GNU Parallel 的链接:
http://www.gnu.org/software/parallel/
One thing I've done is invoked GNU Parallel to run as many instances of Tess* as able on a multi-core system for multi-page documents converted to single page images.
It's a short program, easily compiled on most Linux distros (I'm using OpenSuSE 11.4).
Here's the command line that I use:
The -j 4 tells parallel to use all four CPU cores that I have on a server.
If you run this, and in another terminal do a 'top,' you'll see up to four processes at one time until it rummages through all of the JPG's in the directory specified.
Your load should never exceed the number of CPU cores in your system (if you run Linux).
Here's the link to GNU Parallel:
http://www.gnu.org/software/parallel/
不需要。您可以在 http://code.google.com/ 中浏览代码p/tesseract-ocr/source/browse/ trunk 中的当前代码似乎都没有使用多线程。 (至少查看基类、API 和神经网络类)
No. You can browse the code in http://code.google.com/p/tesseract-ocr/source/browse/ None of the current code in trunk seems to make use of multi-threading. (at least looking through the base classes, api, and neural networking classes)
我也在 Centos 上使用了并行,这样:
我使用了 stdout 日志中建议的
--gnu
选项,即:{}
和{.}
是并行的占位符:在这种情况下,您告诉 tesseract 使用作为第一个参数列出的文件,以及不带扩展名的相同文件名作为第二个参数 - 一切都是在并行手册页中有很好的解释。现在,如果您有 - 比如说 - 三个
.tif
文件,并且运行tesseract
三次,每个文件一次,总结执行时间,然后运行命令上面在parallel
之前加上time
,您可以轻松检查加速比。I did use
parallel
as well, on a Centos, this way:I used the
--gnu
option as suggested from the stdout log which was:the
{}
and{.}
are placeholders for parallel: in this case you're telling tesseract to use the file listed as first argument, and the same file name without extension as second argument - everything is well explained in parallel man pages.Now, if you have - say - three
.tif
files and you runtesseract
three times, one for each file, summing up the execution time, and then you run the command above withtime
beforeparallel
, you can easily check the speedup.