OpenMP 开销
我使用 OpenMP 和 Intel TBB 并行化了图像卷积和 lu 分解。我正在 1-8 核上进行测试。但是,当我在 OPenMP 和 TBB 中的 1 个核心上尝试时,分别使用 set_num_threads(1) 和 task_scheduler_init InitTBB(1) 指定一个线程;由于 TBB 开销,TBB 性能与顺序代码相比表现出一些小幅下降,但令人惊讶的是 OpenMP 在单核上没有显示任何开销,并且执行与顺序代码完全相同(使用 Intel O3 优化级别)。我正在使用 OpenMP 循环的静态调度。这是现实的还是我犯了一些错误?
I have parallelized image convolution and lu factorization using OpenMP and Intel TBB. I am testing it on 1-8 cores. But when I try it on 1 core in OPenMP and TBB by specifying one thread using set_num_threads(1), and task_scheduler_init InitTBB(1) respectively for example; TBB performance shows some small degradation compared to sequential code due to TBB overhead, but surprisingly OpenMP doesnt show any overhead on single core and performs exactly equal to sequential code (using Intel O3 optimization level). I am using static scheduling of OpenMP loops. Is it realistic or am I doing some mistake ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您仅使用一个线程运行 OpenMP 运行时,它可能不会创建任何线程。
此外,仅使用 OpenMP 并行化指令有时也会使串行代码运行得更快,因为您实质上为编译器提供了更多信息。例如,工作共享构造告诉编译器循环的迭代是相互独立的,它可能无法自行推断出这一点,并且允许编译器使用更积极的优化策略。当然,并不总是如此,但我已经看到它在“现实世界代码”中发生。
The OpenMP runtime will probably not create any threads if you run it with just one thread.
Also, just using OpenMP parallelization directives sometimes makes also serial code run faster as you are essentially giving the compiler more information. A work-sharing construct, for example, tells the compiler that the iterations of the loop are independent of each other, which it might not have been able to deduce on its own and which allows the compiler to use more aggressive optimization strategies. Not always, of course, but I have seen it happen with "real world code".
OpenMP 是编译器完成所有工作的地方。如果编译器知道它将始终是串行代码,那么它可以完全合法地跳过所有并行位。
据我了解,TBB 基本上只是一个图书馆。总是必须用必要的部分来修饰您的算法,以并行和串行运行它。
OpenMP is something where the compiler does all the work. If the compiler knows it's going to be serial code always it can quite legitimately skip all of the parallel bits.
TBB as I understand it is basically just a library. It is always going to have to have your algorithm decorated with the necessary parts to run it in parallel as well as serially.
OpenMP 将代码的修饰部分 (#pragma omg for/parallel) 分叉到主线程(也可以在没有 OpenMP 的情况下执行)和其他线程中。
如果您配置为仅使用 1 个线程,则这只是主线程,就像没有 OpenMP 指令一样执行。没有开销,因为执行路径没有分叉。
OpenMP forks a decorated part (#pragma omg for/parallel) of the code into a main thread (that would also be executed without OpenMP) and additional threads.
If you configure to only use 1 thread, then this is only the main thread, executed as it would be without the OpenMP directive. There is no overhead, cause the execution path wasn't forked.
OpenMP 的特点是编译器会为您完成工作,它需要对顺序代码进行最少的修改,并且如果分配给每个线程的任务非常大,通常会给出一些不错的结果。我建议尝试使用 Pthread 或 c++11 上的线程测试您的代码并查看结果。
The thing about OpenMP is that the compiler does the work for you, it requires minimum modification to the sequential code and often give somewhat good results if the tasks given to each thread are quite large. I would suggest to try to test your code using Pthread or thread on c++11 and see the results.