CPU 上的数组并行缩减
有没有办法在 C/C++ 中并行减少 CPU 上的数组?我最近了解到,使用 是不可能的打开mp。还有其他选择吗?
Is there a way to do parallel reduction of an array on CPU in C/C++?. I recently learnt that it's not possible using openmp. Any other alternatives?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
添加:请注意,您可以使用 OpenMP 实现“自定义”缩减,按照此处。
对于 C++:在 Intel 的 TBB 中使用
parallel_reduce
(SO 标签:tbb),您可以减少数组和结构体等复杂类型。尽管与 OpenMP 的缩减条款相比,所需的代码量可能要大得多。作为示例,让我们并行化矩阵到向量乘法的简单实现:
y=Cx
。串行代码由两个循环组成:通常,为了并行化,交换循环以使外循环迭代独立并并行处理它们:
但这并不总是好主意。如果 M 很小而 N 很大,交换循环将无法提供足够的并行性(例如,考虑计算加权 M 维空间中 N 个点的质心,其中
C
是点数组,x
是权重数组)。因此,减少一个数组(即一个点)会很有帮助。以下是如何使用 TBB 来完成此操作(抱歉,代码未经测试,可能会出现错误):免责声明:我隶属于 TBB。
Added: Note that you can implement "custom" reduction with OpenMP, in the way described here.
For C++: with
parallel_reduce
in Intel's TBB (SO tag: tbb), you can make reduction on complex types such as arrays and structs. Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause.As an example, let's parallelize a naive implementation of matrix-to-vector multiplication:
y=Cx
. Serial code consists of two loops:Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel:
However it's not always good idea. If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with
C
being the array of points andx
being the array of weights). So a reduction over an array (i.e. a point) would be helpful. Here is how it can be done with TBB (sorry, the code was not tested, errors are possible):Disclaimer: I am affiliated with TBB.