OpenMP 并行循环比常规循环慢得多

发布于 2025-01-19 18:31:07 字数 752 浏览 1 评论 0原文

整个程序已经缩小了一个简单的测试:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i++) j[i % 4] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

在第一种情况下,通过循环运行大约需要3秒钟,在第二种情况下,结果是不一致的,可能是4-9秒。这两个循环启用了一些优化的速度(例如,偏爱速度和整个程序优化),但是第二个循环仍然明显较慢。我尝试在循环的末尾添加障碍,并明确将数组指定为共享,但这无济于事。我设法使并行循环运行速度更快的唯一情况是使循环空。有什么问题?

Windows 10 X64,CPU Intel Core i5 10300h(4核)

The whole program has been shrunk to a simple test:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i++) j[i % 4] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?

Windows 10 x64, CPU Intel Core i5 10300H (4 cores)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一袭白衣梦中忆 2025-01-26 18:31:08

正如各种评论中已经指出的那样,问题的关键是 false共享。确实,您的示例是可以实验的典型情况。但是,您的代码中也有很多问题,例如:

  • 您可能会在loops变量以及所有jk 表;
  • 您的计时器并不是有史以来真正的最佳选择(诚然,在这种情况下,这是我的典范);
  • 您不利用您计算的值,该值允许编译器完全忽略各种计算;
  • 您的两个循环不是等效的,不会给出相同的结果;为了使其正确,我回到了您的原始i%4公式,并添加了schedule(static,1)子句。这不是一种正确的方法,但是只能在不使用正确的降低子句的情况下获得预期的结果。

然后,我重写了您的示例,并以我认为是错误共享问题的更好解决方案:使用降低条款。

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i++ ) {
         j[i % 4] += 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i++ ) {
        k[i%4] += 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction( +: l[0:4] )
    for ( long long i = 0; i < loops; i++ ) {
        l[i%4] += 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

在我的笔记本电脑上进行编译和运行,而没有优化:

$ g++ -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

关于改进的减少条款带来的说法。
现在,启用编译器的优化给出了更加缓解的图片:

$ g++ -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

如果有的话,这表明编译器如今非常擅长避免当今的大多数虚假共享。实际上,使用您的初始(错误)k [op_get_thread_num()],与降低条款没有时间差异,表明编译器能够避免问题。

As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:

  • You will likely see overflows, both in your loops variable and in all of your j and k tables;
  • Your timer isn't really the best choice ever (admittedly, that is a bit pedantic from my part in this case);
  • You do not make use of the values you computed, which allows the compiler to completely ignore the various computations;
  • Your two loops are not equivalent and won't give the same results; In order to get it right, I went back to your original i%4 formula and added a schedule( static, 1) clause. This is not a proper way of doing it, but it was only to get the expected results without using the correct reduction clause.

Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a reduction clause.

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i++ ) {
         j[i % 4] += 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i++ ) {
        k[i%4] += 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction( +: l[0:4] )
    for ( long long i = 0; i < loops; i++ ) {
        l[i%4] += 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

Compiling and running without optimizations gives on my laptop:

$ g++ -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

Which speaks for itself in regard to the improvement the reduction clause brings.
Now, enabling optimizations from the compiler gives a more mitigated picture:

$ g++ -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous) k[omp_get_thread_num()], there was no time difference with and without the reduction clause, showing that the compiler was able to avoid the issue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文