OpenMP中的默认循环迭代计划

发布于 2025-02-06 03:47:54 字数 1422 浏览 3 评论 0 原文

我使用了OpenMP:我发现的以下语句

omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
    printf("id is %3d   thread is %d\n",i, omp_get_thread_num());
}

(在博客文章中):每个线程都会均匀分配,并且当线程数不被迭代次数排除时,它将被舍入。

好吧,我首先设置 nums = 17 ,结果如下:

id is  12   thread is 4
id is  13   thread is 4
id is  14   thread is 4
id is   9   thread is 3
id is  10   thread is 3
id is  11   thread is 3
id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   6   thread is 2
id is   7   thread is 2
id is   8   thread is 2
id is  15   thread is 5
id is  16   thread is 5
id is   3   thread is 1
id is   4   thread is 1
id is   5   thread is 1

可以看出,\ lceil 17/6 \ rceil = 3(圆形,请原谅我,我不知道该如何插入乳胶公式),结果是预期的。

但是,如果我设置 nums = 19 ,根据舍入的19/6 = 4,则应分配每个线程4个迭代:

id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   3   thread is 0
id is  10   thread is 3
id is  11   thread is 3
id is  12   thread is 3
id is   7   thread is 2
id is   8   thread is 2
id is   9   thread is 2
id is  13   thread is 4
id is  14   thread is 4
id is  15   thread is 4
id is   4   thread is 1
id is   5   thread is 1
id is   6   thread is 1
id is  16   thread is 5
id is  17   thread is 5
id is  18   thread is 5

如您所见,只有第一个分配了4个迭代。

所以我现在无法弄清楚,这到底是什么原因? OpenMP的默认线程调度到底是什么?

I used the following statement from OpenMP:

omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
    printf("id is %3d   thread is %d\n",i, omp_get_thread_num());
}

I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.

Well, I first set NUMS=17, the result is as follows:

id is  12   thread is 4
id is  13   thread is 4
id is  14   thread is 4
id is   9   thread is 3
id is  10   thread is 3
id is  11   thread is 3
id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   6   thread is 2
id is   7   thread is 2
id is   8   thread is 2
id is  15   thread is 5
id is  16   thread is 5
id is   3   thread is 1
id is   4   thread is 1
id is   5   thread is 1

As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.

However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:

id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   3   thread is 0
id is  10   thread is 3
id is  11   thread is 3
id is  12   thread is 3
id is   7   thread is 2
id is   8   thread is 2
id is   9   thread is 2
id is  13   thread is 4
id is  14   thread is 4
id is  15   thread is 4
id is   4   thread is 1
id is   5   thread is 1
id is   6   thread is 1
id is  16   thread is 5
id is  17   thread is 5
id is  18   thread is 5

As you can see, only the first one is assigned 4 iterations.

So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笑叹一世浮沉 2025-02-13 03:47:54

总结所有评论以创建答案(因此要感谢所有评论的人)。

首先,标准所说的要求: -

  • 它说没有任何时间表在未指定时使用哪个时间表。
  • Schedule(static)没有Chunk_size仅指定为“大致相等”的迭代数量,向每个可用线程分配。

其次,当前实际上发生的事情是: -

  • 编译器默认为使用 schedule(static)时,未指定时间表。 (尽管 schedule(非单调:动态)现在可能是一个更好的选择。)

  • 至少LLVM OpenMP运行时将迭代分配给线程,如此Python代码所示(关键部分是 mycount ,其余的只是为了测试并显示您的测试用例)。

     ##
    #显示分配给每个线程的迭代次数
    #在LLVM运行时按计划(静态)
    #
    
     def mycount(我,numthreads,总计):
         base = total // numthreads
         剩余=总%numthreads
         返回基础+1如果我&lt;其余的其他基础
    
     DEF测试(Numthreads,Total):
         打印(“线程:”,numthreads,“迭代:”,总计)
         分配= 0
         对于范围内的线程(0,numthreads):
             mine = mycount(线程,numthreads,总计)
             分配 +=我的
             打印(线程,“:”,我的)
         如果分配了!=总计:
             打印(“ ***错误***”,分配,“分配”,总计,“请求”)
     测试(6,17)
     测试(6,19)
     

如果您运行了两个测试案例的结果: -

% python3 static.py 
Threads:  6  Iterations:  17
0 :  3
1 :  3
2 :  3
3 :  3
4 :  3
5 :  2
Threads:  6  Iterations:  19
0 :  4
1 :  3
2 :  3
3 :  3
4 :  3
5 :  3

如果您想进入循环调度的全部恐怖,则在“ 高性能并行跑步时间 - 设计和实现

PS”值得注意的是,在上面显示的时间表不能通过设置上面显示的时间表来明确要求 block_size 在静态时间表上,因为该标准不允许在此处分配其余的迭代。 (例如,如果我们尝试将10个迭代分配给4个线程,如果我们设置 block_size(2)我们将获得(4,2,2,2),而如果将其设置为3获取(3,3,3,1),而上述方案给出(3,3,2,2),其不平衡为1,而显式方案各个方案的不平衡为2)。

Summarising all the comments to create an answer (so thanks to all who commented).

First, what the standard says/requires :-

  • It says nothing about which schedule should be used when it is unspecified.
  • schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.

Second, what happens in reality at present :-

  • Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)

  • At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).

    #
    # Show number of iterations allocated to each thread
    # by schedule(static) in the LLVM runtime
    #
    
     def myCount(me, numThreads, total):
         base = total // numThreads
         remainder = total % numThreads
         return base+1 if me < remainder else base
    
     def test(numThreads, total):
         print("Threads: ",numThreads, " Iterations: ",total)
         allocated = 0
         for thread in range(0,numThreads):
             mine = myCount(thread,numThreads,total)
             allocated += mine
             print (thread, ": ", mine)
         if allocated != total:
             print ("***ERROR*** ", allocated," allocated, ", total," requested")
     test(6,17)
     test(6,19)
    

If you run that you can see the result for your two test cases:-

% python3 static.py 
Threads:  6  Iterations:  17
0 :  3
1 :  3
2 :  3
3 :  3
4 :  3
5 :  2
Threads:  6  Iterations:  19
0 :  4
1 :  3
2 :  3
3 :  3
4 :  3
5 :  3

If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"

p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文