OpenCL、TBB、OpenMP
我在 OpenMP、TBB 和 OpenCL 中实现了一些正常的循环应用程序。在所有这些应用程序中,当我仅在 CPU 上运行并且没有在内核中进行特定优化时,OpeCL 也提供了比其他应用程序更好的性能。 OpenMP 和 TBB 也提供了良好的性能,但远远低于 OpenCL,这可能是因为它们都是 CPU 专用框架,并且应该至少提供与 OpenMP/TBB 相同的性能。
我的第二个问题是,当谈到 OpenMP 和 TBB 时,在我的实现中,OpenMP 的性能总是比 TBB 更好,因为我不是那么专业,所以我没有对其进行很好的优化。 OpenMP 的性能通常比 TBB 更好,这有什么原因吗?因为我认为它们甚至 OpenCL 也在低级别使用相同类型的线程池......有专家意见吗?谢谢
I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB.
My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
与 TBB 和 OpenMP 相比,OpenCL 的优势之一是它可以更好地利用硬件中的 SIMD 并行性。某些 OpenCL 实现将运行您的代码,以便每个工作项都在计算机的 SIMD 矢量通道中运行,并在单独的内核上运行。根据算法的不同,这可以提供很多性能优势。
C 编译器还可以使用自动向量化来利用 SIMD 并行性的一些优势,但 C 中的内存别名规则使其在某些情况下很难发挥作用。由于 OpenCL 要求程序员显式调用工作项并隔离内存访问,因此 OpenCL 编译器可以更加积极。
最后,这取决于您的代码。人们可以找到一种 OpenCL、OpenMP 或 TBB 中任何一个最适合的算法。
One advantage that OpenCL has over TBB and OpenMP is that it can take better advantage of SIMD parallelism in your hardware. Some OpenCL implementations will run your code such that each work item runs in a SIMD vector lane of the machine, as well as running on separate cores. Depending on the algorithm, this could provide lots of performance benefits.
C compilers can also take some advantage of SIMD parallelism as well, using auto-vectorization, but the memory aliasing rules in C make it hard for this to work in some cases. Since OpenCL requires programmers to call out the work items and fence memory accesses explicitly, an OpenCL compiler can be more aggressive.
In the end, it depends on your code. One could find an algorithm for which any of OpenCL, OpenMP, or TBB are best.
Intel 提供的 CPU 和 MIC 的 OpenCL 运行时在底层使用 TBB。它远不只是“低级别的线程池”,因为它利用 TBB 提供的复杂调度和分区算法来实现更好的负载平衡,从而更好地利用 CPU。
至于 TBB 与 OpenMP。通常,这归结为不正确的测量。例如,TBB 没有像 OpenMP 中那样的隐式屏障,因此预热循环是不够的。您必须确保创建所有线程,并且此开销不包含在您的测量中。另一个例子:有时,编译器无法使用 TBB 对使用 OpenMP 进行矢量化的相同代码进行矢量化。
OpenCL runtime for CPU and MIC provided by Intel uses TBB under the hood. It's far from just 'thread pooling at low level' since it takes advantage of sophisticated scheduling and partitioning algorithms provided by TBB for better load balance and so better utilization of CPUs.
As for TBB vs. OpenMP. Usually, it comes down to incorrect measurements. For example, TBB has no implicit barrier like in OpenMP, so a warm-up loop is not enough. You have to make sure all the threads are created and this overhead is not included into your measurements. Another example: sometimes, compilers are not able to vectorize the same code with TBB which is vectorized with OpenMP.
OpenCL 内核是针对给定硬件编译的。供应商/硬件特定优化的潜力是巨大的。
OpenCL kernels are compiled for the given hardware. The potential for vendor/hardware specific optimisations is huge.