OpenMP。 n线程中环的并行化

发布于 2025-01-27 11:43:08 字数 939 浏览 1 评论 0原文

我试图将5000万次迭代的周期与几个线程平行 - 首先乘以1,然后以4、8和16的速度。下面是实现此功能的代码。

#include <iostream>
#include <omp.h>

using namespace std;

void someFoo();

int main() {
    someFoo();
}

void someFoo() {
    long sum = 0;
    int numOfThreads[] = {1, 4, 8, 16};
    
    for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
        omp_set_num_threads(numOfThreads[j]);
        start = omp_get_wtime();
        #pragma omp parallel for
        for(int i = 0; i<50000000; i++) {
            sum += i * 10;
        }
        #pragma omp end parallel
        
        end = omp_get_wtime();
        
        cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
    }
}

预计该程序在4个线程中的运行速度将比1个线程快于1个线程快于4个线程,而在16个线程中的运行速度将比8个线程快,但实际上不是8个线程 - 事实并非如此 - 一切都在混乱中执行速度,几乎没有区别。同样,在任务管理器中看不到该程序并行化。我有一台具有8个逻辑处理器和4个内核的计算机。

请告诉我我在哪里犯了一个错误,以及如何在n线程中正确地平行循环。

I am trying to parallelize a cycle of 50 million iterations with several threads - first by 1, then by 4, 8 and 16. Below is the code for implementing this functionality.

#include <iostream>
#include <omp.h>

using namespace std;

void someFoo();

int main() {
    someFoo();
}

void someFoo() {
    long sum = 0;
    int numOfThreads[] = {1, 4, 8, 16};
    
    for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
        omp_set_num_threads(numOfThreads[j]);
        start = omp_get_wtime();
        #pragma omp parallel for
        for(int i = 0; i<50000000; i++) {
            sum += i * 10;
        }
        #pragma omp end parallel
        
        end = omp_get_wtime();
        
        cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
    }
}

It is expected that in 4 threads the program will run faster than in 1, in 8 threads faster than in 4 threads, and in 16 threads faster than in 8 threads, but in practice this is not the case - everything is executed at a chaotic speed and there is almost no difference. Also, it is not visible in the task manager that the program is parallelized. I have a computer with 8 logical processors and 4 cores.

Please tell me where I made a mistake and how to properly parallelize the loop in N threads.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自此以后,行同陌路 2025-02-03 11:43:08

代码中有一个种族条件,因为sum同时从多个线程读取/编写。这应该导致错误的结果。您可以使用Directive #pragma OMP并行进行减少(+:SUM)并行使用降低。请注意,OpenMP不检查您的循环是否可以并行,这是您的责任。

此外,并行计算可能比顺序计算要慢,因为聪明的编译器可以看到sum = 50000000*(50000000-1)/2*10 = 1249999999750000000(Afaik,Clang,Clang做到了)。结果,基准肯定是有缺陷的。请注意,这当然比可以包含的类型大,因此代码中肯定有一个溢出

此外,Afaik,没有指令#pragma op end Parallel之类的东西。

最后,请注意,您可以使用OMP_NUM_THREADS环境变量来控制线程数,通常比在应用程序中设置它更方便(在应用程序代码中使用硬线数量的线程通常不是一个好的线程想法,甚至用于基准)。

There is a race condition in your code because sum is read/written from multiple threads at the same time. This should cause wrong results. You can fix this using a reduction with the directive #pragma omp parallel for reduction(+:sum). Note that OpenMP does not check if your loop can be parallelized, it is your responsibility.

Additionally, the parallel computation might be slower than the the sequential one since a clever compiler can see that sum = 50000000*(50000000-1)/2*10 = 12499999750000000 (AFAIK, Clang does that). As a result, the benchmark is certainly flawed. Note that this is certainly bigger than what the type long can contain so there is certainly an overflow in your code.

Moreover, AFAIK, there is no such thing as the directive #pragma omp end parallel.

Finally, note that you can control the number of thread using the OMP_NUM_THREADS environment variable which is generally more convenient than setting it in your application (Hardwiring a given number of thread in the application code is generally not a good idea, even for benchmarks).

素年丶 2025-02-03 11:43:08

请告诉我我在哪里犯了一个错误以及如何正确平行
n线程中的循环。

首先,您需要在代码示例中修复一些编译器问题。就像删除Pragmas一样#pragma Omp end Parallel,正确声明变量等等。第二,您需要在更新变量sum的更新期间修复竞赛条件。该变量在线程之间共享一个同时更新。最简单的方法是使用OpenMP的减少子句,您的代码看起来如下:

#include <stdio.h>
#include <omp.h>

int main() {
    someFoo();
}

void someFoo() {
    int numOfThreads[] = {1, 4, 8, 16};

    for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
        omp_set_num_threads(numOfThreads[j]);
        double start = omp_get_wtime();
        double sum = 0;
        #pragma omp parallel for reduction(+:sum)
        for(int i = 0; i<50000000; i++) {
            sum += i * 10;
        }
        double end = omp_get_wtime();
        printf("Result: '%d' : '%f'\n", sum, (end - start));
    }
}

使用多核运行时,您应该获得一些加速。

注意:解决溢出首先提到的 @jérômeRichard,我将“总和”变量从 long 更改为a double> double> /em>。

Please tell me where I made a mistake and how to properly parallelize
the loop in N threads.

First you need to fix some of the compiler issues on your code example. Like removing pragmas like #pragma omp end parallel, declaring the variables correctly and so on. Second you need to fix the race condition during the update of the variable sum. That variable is shared among threads an updated concurrently. The easiest way would be to use the reduction clause of OpenMP, you code would look like the following:

#include <stdio.h>
#include <omp.h>

int main() {
    someFoo();
}

void someFoo() {
    int numOfThreads[] = {1, 4, 8, 16};

    for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
        omp_set_num_threads(numOfThreads[j]);
        double start = omp_get_wtime();
        double sum = 0;
        #pragma omp parallel for reduction(+:sum)
        for(int i = 0; i<50000000; i++) {
            sum += i * 10;
        }
        double end = omp_get_wtime();
        printf("Result: '%d' : '%f'\n", sum, (end - start));
    }
}

With that you should get some speedups when running with multi-cores.

NOTE: To solve the overflow mentioned first by @Jérôme Richard, I change the 'sum' variable from long to a double.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文