嵌套循环和 openmp 的问题

发布于 2024-10-12 20:58:15 字数 707 浏览 2 评论 0原文

我在将 openmp 应用于这样的嵌套循环时遇到问题：

        #pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
    {
        tid = omp_get_thread_num();
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("\nNumber of threads = %d\n", nthreads);
        }
        #pragma omp for schedule(dynamic,chunk)
        for(a=0;a<NREC;a++){
            for(b=0;b<NLIG;b++){
                S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
            }
        } // end for a
    } /* end of parallel section */

当我将序列号与 openmp 版本进行比较时，最后一个给出了奇怪的结果。即使当我删除 #pragma omp for 时，openmp 的结果也不正确，您知道为什么或者可以指出一个关于双循环和 openmp 的很好的教程吗？

原文

I am having trouble applying openmp to a nested loop like this:

        #pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
    {
        tid = omp_get_thread_num();
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("\nNumber of threads = %d\n", nthreads);
        }
        #pragma omp for schedule(dynamic,chunk)
        for(a=0;a<NREC;a++){
            for(b=0;b<NLIG;b++){
                S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
            }
        } // end for a
    } /* end of parallel section */

When I compare the serial with the openmp version, the last one gives weird results. Even when I remove #pragma omp for, the results from openmp are not correct, do you know why or can point to a good tutorial explicit about double loops and openmp?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我的奇迹 2024-10-19 20:58:15

这是竞争条件的典型示例。每个 openmp 线程都同时访问和更新共享值，并且不能保证某些更新不会丢失（最好的情况），或者生成的答案不会是乱码（最坏的情况）。

竞争条件的问题在于它们对时间的依赖性非常敏感。在较小的情况下（例如，较小的 NREC 和 NLIG），您有时可能会错过这一点，但在较大的情况下，它最终总会出现。

没有使用 #pragma omp for 时得到错误答案的原因是，一旦进入并行区域，所有 openmp 线程都会启动；除非您使用诸如 omp for（所谓的工作共享构造）之类的东西来分割工作，否则每个线程将在并行部分中执行所有操作 - 因此所有线程线程将执行相同的全部总和，同时更新 S2。

您必须小心 OpenMP 线程更新共享变量。 OpenMP 具有原子操作，允许您安全地修改共享变量。下面是一个示例（不幸的是，您的示例对求和顺序非常敏感，很难看出发生了什么，所以我对您的总和进行了一些更改：）。在mysumallatomic中，每个线程像以前一样更新S2，但这一次它是安全完成的：

#include <omp.h>
#include <math.h>
#include <stdio.h>

double mysumorig() {

    double S2 = 0;
    int a, b;
    for(a=0;a<128;a++){
        for(b=0;b<128;b++){
            S2=S2+a*b;
        }
    }

    return S2;
}


double mysumallatomic() {

    double S2 = 0.;
#pragma omp parallel for shared(S2)
    for(int a=0; a<128; a++){
        for(int b=0; b<128;b++){
            double myterm = (double)a*b;
            #pragma omp atomic
            S2 += myterm;
        }
    }

    return S2;
}


double mysumonceatomic() {

    double S2 = 0.;
#pragma omp parallel shared(S2)
    {
        double mysum = 0.;
        #pragma omp for
        for(int a=0; a<128; a++){
            for(int b=0; b<128;b++){
                mysum += (double)a*b;
            }
        }
        #pragma omp atomic
        S2 += mysum;
    }
    return S2;
}

int main() {
    printf("(Serial)      S2 = %f\n", mysumorig());
    printf("(All Atomic)  S2 = %f\n", mysumallatomic());
    printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
    return 0;
}

然而，原子操作确实损害了并行性能（毕竟，重点是防止围绕变量S2进行并行操作！）因此更好的方法是进行求和，并且只在两次求和之后进行原子操作，而不是进行 128*128 次；这就是 mysumonceatomic() 例程，每个线程只会产生一次同步开销，而不是每个线程 16k 次。

但这是一个很常见的操作，没有必要自己实现。可以使用 OpenMP 内置功能进行归约运算（归约是一种运算，例如计算列表的总和、查找列表的最小值或最大值等，一次只能通过查看一个元素来完成）到目前为止的结果和下一个元素）按照@ejd的建议。 OpenMP 可以工作并且速度更快（它的优化实现比您自己使用其他 OpenMP 操作执行的操作要快得多）。

正如您所看到的，两种方法都有效：

$ ./foo
(Serial)      S2 = 66064384.000000
(All Atomic)  S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000

This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).

The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.

The reason you get wrong answers without the #pragma omp for is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like an omp for (a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updating S2 simultatneously.

You have to be careful with OpenMP threads updating shared variables. OpenMP has atomic operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In the mysumallatomic, each thread updates S2 as before, but this time it's done safely:

#include <omp.h>
#include <math.h>
#include <stdio.h>

double mysumorig() {

    double S2 = 0;
    int a, b;
    for(a=0;a<128;a++){
        for(b=0;b<128;b++){
            S2=S2+a*b;
        }
    }

    return S2;
}


double mysumallatomic() {

    double S2 = 0.;
#pragma omp parallel for shared(S2)
    for(int a=0; a<128; a++){
        for(int b=0; b<128;b++){
            double myterm = (double)a*b;
            #pragma omp atomic
            S2 += myterm;
        }
    }

    return S2;
}


double mysumonceatomic() {

    double S2 = 0.;
#pragma omp parallel shared(S2)
    {
        double mysum = 0.;
        #pragma omp for
        for(int a=0; a<128; a++){
            for(int b=0; b<128;b++){
                mysum += (double)a*b;
            }
        }
        #pragma omp atomic
        S2 += mysum;
    }
    return S2;
}

int main() {
    printf("(Serial)      S2 = %f\n", mysumorig());
    printf("(All Atomic)  S2 = %f\n", mysumallatomic());
    printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
    return 0;
}

However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable S2!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's the mysumonceatomic() routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.

But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by @ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).

As you can see, either approach works:

$ ./foo
(Serial)      S2 = 66064384.000000
(All Atomic)  S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000

回复收藏 0 原文

苍暮颜 2024-10-19 20:58:15

问题不在于双循环，而在于变量 S2。尝试在 for 指令中添加一个归约子句：

#pragma omp for Schedule(dynamic,chunk)归约(+:S2)

回复收藏 0 原文

~没有更多了~

关于作者

拿命拼未来

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

嵌套循环和 openmp 的问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

嵌套循环和 openmp 的问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。