为什么 OpenMP 在这种情况下很慢?

发布于 2024-10-25 14:17:21 字数 1046 浏览 2 评论 0原文

我试图理解为什么 OpenMP 会像下面的示例那样工作。

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    size_t i;

#pragma omp parallel for private(i)
    for (i = 0; i < v1.size(); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);

    return 0;

}

我首先在没有启用 OpenMP 的情况下编译了上面的代码(通过在编译标志上省略 -fopenmp)。 N1 = 10000 的执行时间为 0.1 秒。启用 OpenMP 会使执行时间超过 1 分钟。我在完成之前就停止了它(厌倦了等待......)。

我正在编译代码如下:

g++ -std=c++0x -O3 -funroll-loops -march=core2 -fomit-frame-pointer -Wall -fno-strict-aliasing -o main.o -c main.cpp

g++ main.o -o main

此处并非所有这些标志都是必需的,但我正在尝试并行化的项目中使用它们,并且在那里使用这些标志。这就是为什么我决定把他们留在这里。另外,我添加 -fopenmp 以在编译时启用 OpenMP。

有人知道出了什么问题吗?谢谢你!

I am trying to understand why OpenMP works the way it does in the following example.

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    size_t i;

#pragma omp parallel for private(i)
    for (i = 0; i < v1.size(); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);

    return 0;

}

I first compiled the code above without enabling OpenMP (by omitting -fopenmp on the compiling flags). The execution time for N1 = 10000 was 0.1s. Enabling OpenMP makes the execution time go beyond 1 minute. I stopped it before it was done (got tired of waiting...).

I am compiling the code as below:

g++ -std=c++0x -O3 -funroll-loops -march=core2 -fomit-frame-pointer -Wall -fno-strict-aliasing -o main.o -c main.cpp

g++ main.o -o main

Not all these flags are necessary here but I'm using them on the project I'm trying to parallelize and I use those flags there. That's why I decided to leave them here. Also, I add -fopenmp to enable OpenMP on the compilation.

Does anybody know what's going wrong? Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

七婞 2024-11-01 14:17:21

我在 Visual Studio 2008 上尝试过相同的示例。
我对您的代码示例进行了两次修改,使用 OpenMP 时它的运行速度比不使用 OpenMP 时快大约 3 倍。

由于无法在 GCC 上确认,问题可能是在主循环中调用了函数 AddVectors,并且每次都必须执行“fork”操作,这将花费一些可测量的时间。因此,如果 N1 = 10000,它必须产生 10000 个“分叉”操作。

我附上了您自己的代码片段,修改后只是为了使其在 Visual Studio 下工作,
我最后添加了一条 print 语句,以避免编译器删除所有代码。

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    #pragma omp parallel for
    for (int i = 0; i < static_cast<int>(v1.size()); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);


    printf("%g\n",v1[0]);
    return 0;

}

I have tried the same example on Visual Studio 2008.
I did two modification to your code example, and it runs roughly 3 times faster with OpenMP, than without OpenMP.

Without being able to confirm it on GCC, the problem might be in main loop the function AddVectors is called, and each time it has to perform a "fork" operation, and this will take some measurable time. So if you have N1 = 10000, it has to spawn 10000 "fork" operations.

I have attached your own code snippet modified only to make it work under Visual Studio,
and I added a print statement in the end to avoid the compiler removing all the code.

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    #pragma omp parallel for
    for (int i = 0; i < static_cast<int>(v1.size()); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);


    printf("%g\n",v1[0]);
    return 0;

}
窝囊感情。 2024-11-01 14:17:21

g++ 可能会优化整个 AddVectors 调用吗?尝试返回最后一个 v1 元素并将其存储在 易失性变量中。

May be g++ optimized out whole AddVectors calls? Try to return last v1 element and store it in volatile variable.

故笙诉离歌 2024-11-01 14:17:21

问题出在您使用的数组类型上。

向量是一个容器。它是一个存储一些信息的结构,如大小、开始、结束等;并且有几个内置函数,其中运算符[]是其中一个用于访问数据的函数。结果,倾向于加载向量 V 的索引“i”的缓存行加载元素 V[i] 和一些未使用的信息在代码中。

相反,如果您使用经典数组(动态/静态),则运算符 [] 会导致仅加载数据元素。因此,缓存行(通常为 64 字节长)将加载此 double 数组的 8 个元素(double 的大小 = 8 字节)。

请参阅 _mm_malloc 和 malloc 之间的区别以增强数据对齐。

@福兹先生
我对此不太确定。让我们比较这两种情况的性能结果:

i7 处理器上有 4 个线程

阵列花费时间:0.122007 |重复:4 | MFlops:327.85

矢量花费时间:0.101006 |重复:2 | MFlops:188.669

我强制运行时间超过 0.1 秒,因此代码会重复自身。主循环:

const int N = 10000000;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
    for(int r = 0; r < repeat; r++)
    {
        #pragma omp parallel for
        for(int i = 0; i < N; i++)
        {
            A[i] += B[i];
        }
        if(A[0]==0) dummy(A[0]);
    }
    timing(&wce);
    runtime = wce-wcs;
}

MFLops: ((N*repeat)/runtime)/1000000

The problem is with array type you are using.

A vector is a container. It is a structure which stores several information like size, begin, end etc; and has several in-build function, where operator [] is one of them used to access the data. As a result the cache lines which tend to load say for index "i" of the vector V, loads the element V[i] and some information which is not being used in the code.

On the contrary, if you use classical arrays (dynamic/static), the operator [] results in loading only the data elements. As a result a cache line (usually 64 bytes long) will load 8 elements of this double array (size of double = 8 bytes).

See difference between _mm_malloc and malloc for enhancing data alignment.

@Mr Fooz
I am not sure about that. Lets compare the performance results for both the cases:

4 Threads on i7 processor

Array Time Taken.: 0.122007 | Repeat: 4 | MFlops: 327.85

Vector Time Taken: 0.101006 | Repeat: 2 | MFlops: 188.669

I force the runtime to be more than 0.1sec, so the code repeats itself. The main loop:

const int N = 10000000;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
    for(int r = 0; r < repeat; r++)
    {
        #pragma omp parallel for
        for(int i = 0; i < N; i++)
        {
            A[i] += B[i];
        }
        if(A[0]==0) dummy(A[0]);
    }
    timing(&wce);
    runtime = wce-wcs;
}

MFLops: ((N*repeat)/runtime)/1000000

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文