优雅地并行初始化 openmp 线程 for 循环
我有一个 for 循环,它使用(有点复杂)计数器对象 sp_ct
来初始化数组。串行代码看起来像
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter 支持并行化,因为它可以在 p
递增后初始化为状态,从而产生以下工作代码片段:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
我不喜欢这样,因为混乱掩盖了真正发生的事情上,并且因为它在循环内有一个不必要的分支(是的,我知道这可能不会对运行时间产生可测量的影响,因为它是所以可预测的......)。
我更喜欢写这样的东西
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
这可能吗?
I have a for loop that uses a (somewhat complicated) counter object sp_ct
to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p
increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果我概括您的问题,问题是“如何为并行部分的每个线程执行一些初始化代码?”,对吗?您可以使用firstprivate子句的属性:“给定变量的初始化或构造就像在线程执行构造之前为每个线程完成一次一样”。
然后可以编写循环:
不好的事情是您必须编写这个额外的初始化程序,并且一些代码被移离其实际执行点。好处是您可以重用它以及更清晰的循环语法。
If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
Then the loop may be written :
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.
据我所知,您可以通过手动定义块来做到这一点。这看起来有点像我试图在 OpenMP 中进行归纳 OpenMP 归纳:获取 OpenMP 中并行化 for 循环的范围值
所以你可能想要这样的东西:
注意,除了一些声明和更改该代码的范围值几乎与序列代码相同。
此外,您不必声明任何共享或私有的内容。并行块内声明的所有内容都是私有的,而外部声明的所有内容都是共享的。你也不需要firstprivate。这使得代码更干净、更清晰(恕我直言)。
From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).
我明白你想做什么,但我认为这是不可能的。我只是要编写一些代码,我相信这些代码可以实现相同的目标,并且有些干净,如果您喜欢它,那就太棒了!
I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
Riko,实现
sp_ct.depos()
,这样它将仅根据需要频繁调用.increment()
,以将计数器带至传递的参数。然后您可以使用以下代码:此解决方案还有一个额外的好处:仅当每个线程仅接收
0 - size
中的一个块时,您的实现才有效。指定schedule(static)
时会出现这种情况,而忽略块大小 (OpenMP 4.0 规范,第 2.7.1 章,第 57 页)。但由于您没有指定时间表
,所以使用的时间表将取决于实现(OpenMP 4.0 规范,第 2.3.2 章)。如果实现选择使用动态
或引导
,线程将接收多个块,它们之间有间隙。因此,一个线程可以接收块0-20
,然后接收块70-90
,这将使p
和sp_ct
输出第二个块上的同步。上述解决方案与所有时间表兼容。Riko, implement
sp_ct.depos()
, so it will invoke.increment()
only as often as necessary to bring the counter to the passed parameter. Then you can use this code:This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of
0 - size
. Which is the case when specifyingschedule(static)
omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify aschedule
the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to usedynamic
orguided
, threads will receive multiple chunks with gaps between them. So one thread could receive chunk0-20
and then chunk70-90
which will makep
andsp_ct
out of sync on the second chunk. The solution above is compatible to all schedules.