多维嵌套 OpenMP 循环
在 OpenMP 中并行化多维并行循环的正确方法是什么?维数在编译时已知,但哪些维会很大则未知。其中任何一个都可以是一、二或一百万。当然,我不希望 N 维循环有 N 个 omp 并行...
想法:
这个问题在概念上很简单。只有最外层的“大”循环需要并行化,但循环维度在编译时未知,并且可能会发生变化。
动态设置
omp_set_num_threads(1)
和#pragma omp for Schedule(static, Huge_number)
会使某些循环并行化成为无操作吗?这会产生不良的副作用/开销吗?感觉就像是一个拼凑。OpenMP 规范(2.10、A.38、A.39)讲述了符合标准之间的区别和不合格的嵌套并行性,但没有提出解决此问题的最佳方法。
重新排序循环是可能的,但可能会导致大量缓存未命中。展开是可能的,但并不简单。还有其他方法吗?
这是我想要并行化的内容:
for(i0=0; i0<n[0]; i0++) {
for(i1=0; i1<n[1]; i1++) {
...
for(iN=0; iN<n[N]; iN++) {
<embarrasingly parallel operations>
}
...
}
}
谢谢!
What is the proper way to parallelize a multi-dimensional embarrassingly parallel loop in OpenMP? The number of dimensions is known at compile-time, but which dimensions will be large is not. Any of them may be one, two, or a million. Surely I don't want N omp parallel
's for an N-dimensional loop...
Thoughts:
The problem is conceptually simple. Only the outermost 'large' loop needs to be parallelized, but the loop dimensions are unknown at compile-time and may change.
Will dynamically setting
omp_set_num_threads(1)
and#pragma omp for schedule(static, huge_number)
make certain loop parallelizations a no-op? Will this have undesired side-effects/overhead? Feels like a kludge.The OpenMP Specification (2.10, A.38, A.39) tells the difference between conforming and non-conforming nested parallelism, but doesn't suggest the best approach to this problem.
Re-ordering the loops is possible but may result in a lot of cache-misses. Unrolling is possible but non-trivial. Is there another way?
Here's what I'd like to parallelize:
for(i0=0; i0<n[0]; i0++) {
for(i1=0; i1<n[1]; i1++) {
...
for(iN=0; iN<n[N]; iN++) {
<embarrasingly parallel operations>
}
...
}
}
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
collapse
指令可能就是您正在寻找的内容,如所述在这里。这本质上将形成一个循环,然后将其并行化,并且专门针对此类情况而设计。所以你会这样做:一切准备就绪。
The
collapse
directive is probably what you're looking for, as described here. This will essentially form a single loop, which is then parallized, and is designed for exactly these sorts of situations. So you'd do:and be all set.