优雅地并行初始化 openmp 线程 for 循环

发布于 2024-10-10 13:02:20 字数 1044 浏览 5 评论 0原文

我有一个 for 循环，它使用（有点复杂）计数器对象 sp_ct 来初始化数组。串行代码看起来像

sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
  in[p]=sp_ct.parable_at_basis();
}

My counter 支持并行化，因为它可以在 p 递增后初始化为状态，从而产生以下工作代码片段：

  int firstloop=-1;
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p=0;p<size;p++) {
    if( firstloop == -1 ) {
      sp_ct.depos(p); firstloop=0;
    } else { 
      sp_ct.increment();
    }
    in[p]=sp_ct.parable_at_basis();
  } // end omp paralell for

我不喜欢这样，因为混乱掩盖了真正发生的事情上，并且因为它在循环内有一个不必要的分支（是的，我知道这可能不会对运行时间产生可测量的影响，因为它是所以可预测的......）。

我更喜欢写这样的东西

#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p=0;p<size;p++) {
#prgma omp initialize // or something
    {  sp_ct.depos(p); }
    in[p]=sp_ct.parable_at_basis();
    sp_ct.increment();
    }
  } // end omp paralell for

这可能吗？

原文

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like

sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
  in[p]=sp_ct.parable_at_basis();
}

My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:

  int firstloop=-1;
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p=0;p<size;p++) {
    if( firstloop == -1 ) {
      sp_ct.depos(p); firstloop=0;
    } else { 
      sp_ct.increment();
    }
    in[p]=sp_ct.parable_at_basis();
  } // end omp paralell for

I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).

I would prefer to write something like

#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p=0;p<size;p++) {
#prgma omp initialize // or something
    {  sp_ct.depos(p); }
    in[p]=sp_ct.parable_at_basis();
    sp_ct.increment();
    }
  } // end omp paralell for

Is this possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给我一枪 2024-10-17 13:02:20

如果我概括您的问题，问题是“如何为并行部分的每个线程执行一些初始化代码？”，对吗？您可以使用firstprivate子句的属性：“给定变量的初始化或构造就像在线程执行构造之前为每个线程完成一次一样”。

struct thread_initializer
{
  explicit thread_initializer(
    int size /*initialization params*/) : size_(size) {}

  //Copy constructor that does the init
  thread_initializer(thread_initializer& _it) : size_(_it.size)
  {
    //Here goes once per thread initialization
    for(int p=0;p<size;p++)
      sp_ct.depos(p);
  }

  int size_;
  scp_type sp_ct;
};

然后可以编写循环：

thread_initializer init(size);
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
  init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();

不好的事情是您必须编写这个额外的初始化程序，并且一些代码被移离其实际执行点。好处是您可以重用它以及更清晰的循环语法。

If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".

struct thread_initializer
{
  explicit thread_initializer(
    int size /*initialization params*/) : size_(size) {}

  //Copy constructor that does the init
  thread_initializer(thread_initializer& _it) : size_(_it.size)
  {
    //Here goes once per thread initialization
    for(int p=0;p<size;p++)
      sp_ct.depos(p);
  }

  int size_;
  scp_type sp_ct;
};

Then the loop may be written :

thread_initializer init(size);
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
  init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();

The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.

回复收藏 0 原文

撩起发的微风 2024-10-17 13:02:20

据我所知，您可以通过手动定义块来做到这一点。这看起来有点像我试图在 OpenMP 中进行归纳 OpenMP 归纳：获取 OpenMP 中并行化 for 循环的范围值

所以你可能想要这样的东西：

#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int start = ithread*size/nthreads;
    const int finish = (ithread+1)*size/nthreads;       
    Counter_class_name sp_ct;

    sp_ct.depos(start);   
    for(int p=start; p<finish; p++, sp_ct.increment()) {
        in[p]=sp_ct.parable_at_basis();
    }
}

注意，除了一些声明和更改该代码的范围值几乎与序列代码相同。

此外，您不必声明任何共享或私有的内容。并行块内声明的所有内容都是私有的，而外部声明的所有内容都是共享的。你也不需要firstprivate。这使得代码更干净、更清晰（恕我直言）。

From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP

So you probably want something like this:

#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int start = ithread*size/nthreads;
    const int finish = (ithread+1)*size/nthreads;       
    Counter_class_name sp_ct;

    sp_ct.depos(start);   
    for(int p=start; p<finish; p++, sp_ct.increment()) {
        in[p]=sp_ct.parable_at_basis();
    }
}

Notice that except for some declarations and changing the range values this code is almost identical to the serial code.

Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).

回复收藏 0 原文

第几種人 2024-10-17 13:02:20

我明白你想做什么，但我认为这是不可能的。我只是要编写一些代码，我相信这些代码可以实现相同的目标，并且有些干净，如果您喜欢它，那就太棒了！

sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p = 1; p < size; p++) {
    sp_ct.increment();
    in[p]=sp_ct.parable_at_basis();
  } // end omp paralell for

I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!

sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct,firstloop)
  for(int p = 1; p < size; p++) {
    sp_ct.increment();
    in[p]=sp_ct.parable_at_basis();
  } // end omp paralell for

回复收藏 0 原文

泪眸﹌ 2024-10-17 13:02:20

Riko，实现 sp_ct.depos()，这样它将仅根据需要频繁调用 .increment()，以将计数器带至传递的参数。然后您可以使用以下代码：

sp_ct.depos(0);
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
  sp_ct.depos(p);
  in[p]=sp_ct.parable_at_basis();
} // end omp paralell for

此解决方案还有一个额外的好处：仅当每个线程仅接收 0 - size 中的一个块时，您的实现才有效。指定 schedule(static) 时会出现这种情况，而忽略块大小 (OpenMP 4.0 规范，第 2.7.1 章，第 57 页）。但由于您没有指定时间表，所以使用的时间表将取决于实现（OpenMP 4.0 规范，第 2.3.2 章）。如果实现选择使用动态或引导，线程将接收多个块，它们之间有间隙。因此，一个线程可以接收块 0-20，然后接收块 70-90，这将使 p 和 sp_ct 输出第二个块上的同步。上述解决方案与所有时间表兼容。

Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:

sp_ct.depos(0);
#pragma omp parallel for \
       default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
  sp_ct.depos(p);
  in[p]=sp_ct.parable_at_basis();
} // end omp paralell for

This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.