OpenMP 程序性能低下

发布于 2024-10-08 17:24:18 字数 2605 浏览 2 评论 0原文

我试图从此处理解 openmp 代码。您可以看到下面的代码。

为了测量加速比，串行版本和omp版本之间的差异，我使用time.h，你觉得这种方法正确吗？
该程序在 4 核机器上运行。我指定 export OMP_NUM_THREADS="4" 但看不到明显的加速，通常我得到 1.2 - 1.7。我在并行化过程中遇到了哪些问题？
我可以使用哪种调试/性能工具来查看性能损失？

代码（对于编译，我使用xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe）

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE   1000000
#define N       100000000

int main (int argc, char *argv[]) 
{
    int nthreads, tid, i, chunk;
    float a[N], b[N], c[N];
    unsigned long elapsed;
    unsigned long elapsed_serial;
    unsigned long elapsed_omp;
    struct timeval start;
    struct timeval stop;


    chunk = CHUNKSIZE;

    // =================    SERIAL     start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
    for (i=0; i<N; i++)
    {
        c[i] = a[i] + b[i];
        //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
    }
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_serial = elapsed ;
    printf ("   \n Time SEQ= %lu microsecs\n", elapsed_serial);
    // =================    SERIAL     end =======================


    // =================    OMP    start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
    {
        tid = omp_get_thread_num();
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
        }
        //printf("Thread %d starting...\n",tid);

#pragma omp for schedule(static,chunk)
        for (i=0; i<N; i++)
        {
            c[i] = a[i] + b[i];
            //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
        }

    }  /* end of parallel section */
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_omp = elapsed ;
    printf ("   \n Time OMP= %lu microsecs\n", elapsed_omp);
    // =================    OMP    end =======================
    printf ("   \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;

}

原文

I am trying to understand an openmp code from here. You can see the code below.

In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?

code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE   1000000
#define N       100000000

int main (int argc, char *argv[]) 
{
    int nthreads, tid, i, chunk;
    float a[N], b[N], c[N];
    unsigned long elapsed;
    unsigned long elapsed_serial;
    unsigned long elapsed_omp;
    struct timeval start;
    struct timeval stop;


    chunk = CHUNKSIZE;

    // =================    SERIAL     start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
    for (i=0; i<N; i++)
    {
        c[i] = a[i] + b[i];
        //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
    }
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_serial = elapsed ;
    printf ("   \n Time SEQ= %lu microsecs\n", elapsed_serial);
    // =================    SERIAL     end =======================


    // =================    OMP    start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
    {
        tid = omp_get_thread_num();
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
        }
        //printf("Thread %d starting...\n",tid);

#pragma omp for schedule(static,chunk)
        for (i=0; i<N; i++)
        {
            c[i] = a[i] + b[i];
            //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
        }

    }  /* end of parallel section */
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_omp = elapsed ;
    printf ("   \n Time OMP= %lu microsecs\n", elapsed_omp);
    // =================    OMP    end =======================
    printf ("   \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

从来不烧饼 2024-10-15 17:24:18

上面的代码并没有什么问题，但是你的加速将受到主循环 c=a+b 几乎没有工作的限制——计算所需的时间（单个加法）将由内存访问时间（2 次加载和 1 次存储）主导，并且随着更多线程作用于阵列，内存带宽的争用也会增加。

我们可以通过使循环内的工作更加计算密集来测试这一点：

c[i] = exp(sin(a[i])) + exp(cos(b[i]));

然后我们得到的

$ ./apb

 Time SEQ= 17678571 microsecs
Number of threads = 4

 Time OMP= 4703485 microsecs

 speedup= 3.758611

结果显然更接近预期的 4 倍加速。

更新：哦，对于其他问题 - gettimeofday() 可能适合计时，并且在您使用 xlc 的系统上 - 这是 AIX 吗？在这种情况下，peekperf 是一个很好的整体性能工具，硬件性能监视器将使您能够了解内存访问时间。在 x86 平台上，用于线程代码性能监控的免费工具包括用于缓存性能调试的 cachegrind/valgrind（不是这里的问题）、用于一般 OpenMP 问题的 scalasca，以及 OpenSpeedShop 也非常有用。

There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.

We can test this by making the work inside the loop more compute-intensive:

c[i] = exp(sin(a[i])) + exp(cos(b[i]));

And then we get

$ ./apb

 Time SEQ= 17678571 microsecs
Number of threads = 4

 Time OMP= 4703485 microsecs

 speedup= 3.758611

which is obviously a lot closer to the 4x speedup one would expect.

Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

回复收藏 0 原文

~没有更多了~