OpenMP 程序性能低下
我试图从 此处 理解 openmp 代码。您可以看到下面的代码。
为了测量加速比,串行版本和omp版本之间的差异,我使用time.h,你觉得这种方法正确吗?
该程序在 4 核机器上运行。我指定
export OMP_NUM_THREADS="4"
但看不到明显的加速,通常我得到 1.2 - 1.7。我在并行化过程中遇到了哪些问题?我可以使用哪种调试/性能工具来查看性能损失?
代码(对于编译,我使用xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe
)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}
I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify
export OMP_NUM_THREADS="4"
but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe
)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
上面的代码并没有什么问题,但是你的加速将受到主循环 c=a+b 几乎没有工作的限制——计算所需的时间(单个加法)将由内存访问时间(2 次加载和 1 次存储)主导,并且随着更多线程作用于阵列,内存带宽的争用也会增加。
我们可以通过使循环内的工作更加计算密集来测试这一点:
然后我们得到的
结果显然更接近预期的 4 倍加速。
更新:哦,对于其他问题 - gettimeofday() 可能适合计时,并且在您使用 xlc 的系统上 - 这是 AIX 吗?在这种情况下,peekperf 是一个很好的整体性能工具,硬件性能监视器将使您能够了解内存访问时间。在 x86 平台上,用于线程代码性能监控的免费工具包括用于缓存性能调试的 cachegrind/valgrind(不是这里的问题)、用于一般 OpenMP 问题的 scalasca,以及 OpenSpeedShop 也非常有用。
There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
And then we get
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.