OpenMP：同一编译指示上的 nowait 和归约子句

发布于 2024-11-15 00:00:55 字数 484 浏览 2 评论 0原文

我正在研究OpenMP，并遇到以下示例：

#pragma omp parallel shared(n,a,b,c,d,sum) private(i)
{
    #pragma omp for nowait
    for (i=0; i<n; i++)
        a[i] += b[i];

    #pragma omp for nowait
    for (i=0; i<n; i++)
        c[i] += d[i];
    #pragma omp barrier

    #pragma omp for nowait reduction(+:sum)
    for (i=0; i<n; i++)
        sum += a[i] + c[i];
} /*-- End of parallel region --*/

在最后一个for循环中，有一个nowait和一个reduction子句。这是正确的吗？缩减条款不需要同步吗？

原文

I am studying OpenMP, and came across the following example:

#pragma omp parallel shared(n,a,b,c,d,sum) private(i)
{
    #pragma omp for nowait
    for (i=0; i<n; i++)
        a[i] += b[i];

    #pragma omp for nowait
    for (i=0; i<n; i++)
        c[i] += d[i];
    #pragma omp barrier

    #pragma omp for nowait reduction(+:sum)
    for (i=0; i<n; i++)
        sum += a[i] + c[i];
} /*-- End of parallel region --*/

In the last for loop, there is a nowait and a reduction clause. Is this correct? Doesn't the reduction clause need to be syncronized?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若言繁花未落 2024-11-22 00:00:55

第二个和最后一个循环中的 nowait 有点多余。 OpenMP 规范在区域末尾之前提到了 nowait，因此也许可以保留该区域。

但是第二个循环之前的 nowait 和它之后的显式屏障相互抵消。

最后，关于 shared 和 private 子句。在您的代码中，shared 不起作用，并且根本不应该使用 private：如果您需要线程私有变量，只需在内部声明它即可平行区域。特别是，您应该在循环内部而不是之前声明循环变量。

为了使 shared 有用，您需要告诉 OpenMP 默认情况下不应共享任何内容。您应该这样做以避免由于意外共享变量而导致的错误。这是通过指定default(none)来完成的。这给我们留下了：

#pragma omp parallel default(none) shared(n, a, b, c, d, sum)
{
    #pragma omp for nowait
    for (int i = 0; i < n; ++i)
        a[i] += b[i];

    #pragma omp for
    for (int i = 0; i < n; ++i)
        c[i] += d[i];

    #pragma omp for nowait reduction(+:sum)
    for (int i = 0; i < n; ++i)
        sum += a[i] + c[i];
} // End of parallel region

The nowaits in the second and last loop are somewhat redundant. The OpenMP spec mentions nowait before the end of the region so perhaps this can stay in.

But the nowait before the second loop and the explicit barrier after it cancel each other out.

Lastly, about the shared and private clauses. In your code, shared has no effect, and private simply shouldn’t be used at all: If you need a thread-private variable, just declare it inside the parallel region. In particular, you should declare loop variables inside the loop, not before.

To make shared useful, you need to tell OpenMP that it shouldn’t share anything by default. You should do this to avoid bugs due to accidentally shared variables. This is done by specifying default(none). This leaves us with:

#pragma omp parallel default(none) shared(n, a, b, c, d, sum)
{
    #pragma omp for nowait
    for (int i = 0; i < n; ++i)
        a[i] += b[i];

    #pragma omp for
    for (int i = 0; i < n; ++i)
        c[i] += d[i];

    #pragma omp for nowait reduction(+:sum)
    for (int i = 0; i < n; ++i)
        sum += a[i] + c[i];
} // End of parallel region

回复收藏 0 原文

故事↓在人 2024-11-22 00:00:55

在某些方面，这似乎是一个家庭作业问题，我讨厌为人们做这件事。另一方面，上面的答案并不完全准确，我觉得应该更正。

首先，虽然在此示例中不需要共享子句和私有子句，但我不同意康拉德认为不应使用它们的观点。人们并行化代码最常见的问题之一是他们没有花时间去理解变量是如何使用的。没有私有化和/或保护共享变量，这是我看到的最大数量的问题。通过练习检查如何使用变量并将它们放入适当的共享、私有等子句中将大大减少您遇到的问题数量。

至于有关障碍的问题，第一个循环可以有 nowait 子句，因为第二个循环中没有使用计算值 (a)。仅当计算值 (c) 在计算值之前未使用时（即，不存在依赖性），第二个循环才可以具有 nowait 子句。在原始示例代码中，第二个循环上有一个 nowait，但在第三个循环之前有一个显式屏障。这很好，因为您的教授试图展示显式屏障的使用 - 尽管在第二个循环中省略 nowait 会使显式屏障变得多余（因为循环末尾有一个隐式屏障）。

另一方面，第二个循环上的 nowait 和显式屏障可能根本不需要。在 OpenMP V3.0 规范之前，许多人认为规范中未明确说明的事情是正确的。在 OpenMP V3.0 规范中，以下内容已添加到第 2.5.1 节循环构造，表 2-1 调度子句 kind 值，静态（调度）：

静态调度的合规实现必须确保相同的
将逻辑迭代编号分配给线程将在两个循环中使用
区域，如果满足以下条件： 1) 两个循环区域都有
循环迭代次数相同，2) 两个循环区域具有相同的值
指定了 chunk_size，或者两个循环区域都没有指定 chunk_size，并且 3)
两个循环区域都绑定到同一平行区域。之间的数据依赖性
保证满足两个这样的循环中相同的逻辑迭代
允许安全使用 nowait 子句（请参阅第 170 页上的 A.9 节了解
示例）。

现在，在您的示例中，任何循环上都没有显示时间表，因此这可能成立，也可能不成立。原因是，默认计划是实现定义的，虽然大多数实现当前将默认计划定义为静态，但不能保证这一点。如果您的教授在所有三个循环上都设置了没有 chunk-size 的静态调度类型，那么 nowait 可以在第一个和第二个循环上使用，并且不会出现任何障碍（无论是隐式的还是显式的）在第二个和第三个循环之间根本需要。

现在我们进入第三个循环以及您关于 nowait 和归约的问题。正如 Michy 指出的，OpenMP 规范允许同时指定（reduction 和 nowait）。然而，并不需要同步就能完成缩减。在示例中，可以使用 nowait 删除隐式屏障（在第三个循环结束时）。这是因为在遇到并行区域的隐式障碍之前没有使用归约（求和）。

如果您查看 OpenMP V3.0 规范的第 2.9.3.6 节缩减条款，您会发现以下内容：

如果不使用nowait，则归约计算将在本次结束时完成。
构造;但是，如果在 nowait 的构造上使用归约子句
同样适用的是，对原始列表项的访问将创建一个竞赛，因此，
未指定的效果，除非同步确保它们在所有线程完成之后发生
执行所有迭代或部分构造，以及归约计算
已完成并存储该列表项的计算值。这最简单的是
通过屏障同步确保。

这意味着，如果您想在第三次循环之后在并行区域中使用 sum 变量，那么在使用它之前您将需要一个屏障（隐式或显式）。从现在的例子来看，这是正确的。

In some regards this seems like a homework problem, which I hate to do for people. On the other hand, the answers above are not totally accurate and I feel should be corrected.

First, while in this example both the shared and private clauses are not needed, I disagree with Konrad that they shouldn't be used. One of the most common problems with people parallelizing code, is that they don't take the time to understand how the variables are being used. Not privatizing and/or protecting shared variables that should be, accounts for the largest number of problems that I see. Going through the exercise of examining how variables are used and putting them into the appropriate shared, private, etc. clauses will greatly reduce the number of problems you have.

As for the question about the barriers, the first loop can have a nowait clause, because there is no use of the value computed (a) in the second loop. The second loop can have a nowait clause only if the value computed (c) is not used before the values are calculated (i.e., there is no dependency). In the original example code there is a nowait on the second loop, but an explicit barrier before the third loop. This is fine, since your professor was trying to show the use of an explicit barrier - though leaving off the nowait on the second loop would make the explicit barrier redundant (since there is an implicit barrier at the end of a loop).

On the other hand, the nowait on the second loop and the explicit barrier may not be needed at all. Prior to the OpenMP V3.0 specification, many people assumed that something was true that was not clarified in the specification. With the OpenMP V3.0 specification the following was added to section 2.5.1 Loop Construct, Table 2-1 schedule clause kind values, static (schedule):

A compliant implementation of static schedule must ensure that the same
assignment of logical iteration numbers to threads will be used in two loop
regions if the following conditions are satisfied: 1) both loop regions have the
same number of loop iterations, 2) both loop regions have the same value of
chunk_size specified, or both loop regions have no chunk_size specified, and 3)
both loop regions bind to the same parallel region. A data dependence between
the same logical iterations in two such loops is guaranteed to be satisfied
allowing safe use of the nowait clause (see Section A.9 on page 170 for
examples).

Now in your example, no schedule was shown on any of the loops, so this may or may not hold. The reason is, that the default schedule is implementation defined and while most implementations currently define the default schedule to be static, there is no guarantee of that. If your professor had put on a schedule type of static without a chunk-size on all three loops, then nowait could be used on the first and second loop and no barrier (either implicit or explicit) would be needed between the second and third loops at all.

Now we get to the third loop and your question about nowait and reduction. As Michy pointed out, the OpenMP specification allows both (reduction and nowait) to be specified. However, it is not true that no synchronization is needed for the reduction to be complete. In the example, the implicit barrier (at the end of the third loop) can be removed with the nowait. This is because the reduction (sum) is not being used before the implicit barrier of the parallel region has been encountered.

If you look at the OpenMP V3.0 specification, section 2.9.3.6 reduction clause, you will find the following:

If nowait is not used, the reduction computation will be complete at the end of the
construct; however, if the reduction clause is used on a construct to which nowait is
also applied, accesses to the original list item will create a race and, thus, have
unspecified effect unless synchronization ensures that they occur after all threads have
executed all of their iterations or section constructs, and the reduction computation
has completed and stored the computed value of that list item. This can most simply be
ensured through a barrier synchronization.

This means that if you wanted to use the sum variable in the parallel region after the third loop, then you would need a barrier (either implicit or explicit) before you used it. As the example stands now, it is correct.

回复收藏 0 原文

做个少女永远怀春 2024-11-22 00:00:55

OpenMP 规范说：

循环结构的语法如下：
#pragma omp for [子句[[,]子句] ... ] 换行
    for循环
where 子句是以下之一：
<前><代码> ...
减少（运算符：列表）
...
不等等

因此可以有更多子句，因此可以同时有reduction 和nowait 语句。

在 reduction 子句中不需要显式同步 - 由于 reduction(+: sum) 和 reduction(+: sum) 的作用，对 sum 变量的添加是同步的先前的障碍迫使 a 和 b 在 reduction 循环时具有最终值。 nowait 意味着如果线程完成循环中的工作，则不必等到所有其他线程完成同一循环。

The OpenMP speficication says:

The syntax of the loop construct is as follows:
#pragma omp for [clause[[,] clause] ... ] new-line
    for-loops
where clause is one of the following:
 ...
 reduction(operator: list)
 ...
 nowait

So there can be more clauses thus there can be both reduction and nowait statement.

There is no need of explicit synchronization in the reduction clause - the adding to the sum variable is synchronized because of reduction(+: sum) and previous barrier forces a and b having final values in the time of reduction loop. The nowait means that if the thread finishes the work in the loop, it does not have to wait until all other threads will finish the same loop.

回复收藏 0 原文

~没有更多了~