OpenMP中的默认循环迭代计划
我使用了OpenMP:我发现的以下语句
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
(在博客文章中):每个线程都会均匀分配,并且当线程数不被迭代次数排除时,它将被舍入。
好吧,我首先设置 nums = 17
,结果如下:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
可以看出,\ lceil 17/6 \ rceil = 3(圆形,请原谅我,我不知道该如何插入乳胶公式),结果是预期的。
但是,如果我设置 nums = 19
,根据舍入的19/6 = 4,则应分配每个线程4个迭代:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
如您所见,只有第一个分配了4个迭代。
所以我现在无法弄清楚,这到底是什么原因? OpenMP的默认线程调度到底是什么?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
总结所有评论以创建答案(因此要感谢所有评论的人)。
首先,标准所说的要求: -
Schedule(static)
没有Chunk_size仅指定为“大致相等”的迭代数量,向每个可用线程分配。其次,当前实际上发生的事情是: -
编译器默认为使用
schedule(static)
时,未指定时间表。 (尽管schedule(非单调:动态)
现在可能是一个更好的选择。)至少LLVM OpenMP运行时将迭代分配给线程,如此Python代码所示(关键部分是
mycount
,其余的只是为了测试并显示您的测试用例)。如果您运行了两个测试案例的结果: -
如果您想进入循环调度的全部恐怖,则在“ 高性能并行跑步时间 - 设计和实现“
PS”值得注意的是,在上面显示的时间表不能通过设置上面显示的时间表来明确要求
block_size
在静态时间表上,因为该标准不允许在此处分配其余的迭代。 (例如,如果我们尝试将10个迭代分配给4个线程,如果我们设置block_size(2)
我们将获得(4,2,2,2),而如果将其设置为3获取(3,3,3,1),而上述方案给出(3,3,2,2),其不平衡为1,而显式方案各个方案的不平衡为2)。Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
schedule(static)
with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.Second, what happens in reality at present :-
Compilers default to using
schedule(static)
when no schedule is specified. (Thoughschedule(nonmonotonic:dynamic)
might, now, be a better choice.)At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is
myCount
, the rest is just to test it and show your test cases).If you run that you can see the result for your two test cases:-
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a
block_size
on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we setblock_size(2)
we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).