OpenMP 和共享结构和指针
我有一个通过引用传递两个结构的函数。这些结构由动态分配的数组组成。现在,当我尝试实现 OpenMP 时,我发现速度变慢了,而不是加速了。我认为这可以归因于可能的共享问题。以下是一些供您细读的代码(C):
void leap(MHD *mhd,GRID *grid,short int gchk)
{
/*-- V A R I A B L E S --*/
// Indexes
int i,j,k,tid;
double rhoinv[grid->nx][grid->ny][grid->nz];
double rhoiinv[grid->nx][grid->ny][grid->nz];
double rhoeinv[grid->nx][grid->ny][grid->nz];
double rhoninv[grid->nx][grid->ny][grid->nz]; // Rho Inversion
#pragma omp parallel shared(mhd->rho,mhd->rhoi,mhd->rhoe,mhd->rhon,grid,rhoinv,rhoiinv,rhoeinv,rhoninv) \
private(i,j,k,tid,stime)
{
tid=omp_get_thread_num();
printf("----- Thread %d Checking in!\n",tid);
#pragma omp barrier
if (tid == 0)
{
stime=clock();
printf("-----1) Calculating leap helpers");
}
#pragma omp for
for(i=0;i<grid->nx;i++)
{
for(j=0;j<grid->ny;j++)
{
for(k=0;k<grid->nz;k++)
{
// rho's
rhoinv[i][j][k]=1./mhd->rho[i][j][k];
rhoiinv[i][j][k]=1./mhd->rhoi[i][j][k];
rhoeinv[i][j][k]=1./mhd->rhoe[i][j][k];
rhoninv[i][j][k]=1./mhd->rhon[i][j][k];
}
}
}
if (tid == 0)
{
printf("........%04.2f [s] -----\n",(clock()-stime)/CLOCKS_PER_SEC);
stime=clock();
}
#pragma omp barrier
}/*-- End Parallel Region --*/
}
现在我已经尝试了默认(共享)和共享(mhd),但都没有显示出任何改进的迹象。难道是因为数组是
mhd->rho=(double ***)newarray(nx,ny,nz,sizeof(double));
通过声明结构或指向结构元素的指针来分配的,所以我实际上并没有共享内存,只是共享指向它的指针?在此示例中,Oh 和 nx=389 ny=7 且 nz=739。对于 8 个线程,此部分的串行执行时间为 0.23 [s] 和 0.79 [s]。
I have a function which is passed two structures by reference. These structures are composed of dynamically allocated arrays. Now when I try to implement OpenMP I'm getting a slowdown not a speedup. I'm thinking this can be attributed to possible sharing issues. Here's some of the code for your perusal (C):
void leap(MHD *mhd,GRID *grid,short int gchk)
{
/*-- V A R I A B L E S --*/
// Indexes
int i,j,k,tid;
double rhoinv[grid->nx][grid->ny][grid->nz];
double rhoiinv[grid->nx][grid->ny][grid->nz];
double rhoeinv[grid->nx][grid->ny][grid->nz];
double rhoninv[grid->nx][grid->ny][grid->nz]; // Rho Inversion
#pragma omp parallel shared(mhd->rho,mhd->rhoi,mhd->rhoe,mhd->rhon,grid,rhoinv,rhoiinv,rhoeinv,rhoninv) \
private(i,j,k,tid,stime)
{
tid=omp_get_thread_num();
printf("----- Thread %d Checking in!\n",tid);
#pragma omp barrier
if (tid == 0)
{
stime=clock();
printf("-----1) Calculating leap helpers");
}
#pragma omp for
for(i=0;i<grid->nx;i++)
{
for(j=0;j<grid->ny;j++)
{
for(k=0;k<grid->nz;k++)
{
// rho's
rhoinv[i][j][k]=1./mhd->rho[i][j][k];
rhoiinv[i][j][k]=1./mhd->rhoi[i][j][k];
rhoeinv[i][j][k]=1./mhd->rhoe[i][j][k];
rhoninv[i][j][k]=1./mhd->rhon[i][j][k];
}
}
}
if (tid == 0)
{
printf("........%04.2f [s] -----\n",(clock()-stime)/CLOCKS_PER_SEC);
stime=clock();
}
#pragma omp barrier
}/*-- End Parallel Region --*/
}
Now I've tried default(shared) and shared(mhd) but neither show any signs of improvement. Could it be that since the arrays are allocated
mhd->rho=(double ***)newarray(nx,ny,nz,sizeof(double));
That by declaring the structure or the pointer to the element of the structure that I'm not actually sharing the memory just the pointers to it? Oh and nx=389 ny=7 and nz=739 in this example. Execution time for this section in serial is 0.23 [s] and 0.79 [s] for 8 threads.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我的问题归结为一个真正简单的错误......clock()。虽然我确实通过仅让特定线程计算时间来保护我的计时算法,但我忘记了关于 Clock() 的一件重要事情……它返回挂钟时间,即总处理器时间(活动线程的总和)。我需要调用的是 omp_get_wtime()。这样做时,我突然发现代码的许多部分都得到了加速。作为记录,我已经修改了我的代码以包括
我的计时算法现在是
My issue boiled down to a real simple mistake....clock(). While I did protect my timing algorithm by only having a specific thread calculate the time, I forgot one important thing about clock()...it returns wall clock time which is the total processor time (summation over the active threads). What I needed to be calling was omp_get_wtime(). Doing this I suddenly see a speedup for many sections of my code. For the record I've modified my code to include
And my timing algorithm is now
这里重要的一点可能是循环的上限。由于您使用
grid->nz
等,openMP无法知道它们是否会在每次迭代中发生变化。将这些值加载到局部变量中并将其用于循环条件。An important point here could be your upper bound of your loops. Since you use
grid->nz
etc openMP can't know if they will change or not for each iteration. Load these values in local variables and use these for the loop condition.嗯,你还使用双打和除法。你能把除法化为乘法吗?
浮点单元在内核之间共享,并且除法在完成之前没有确定的周期数(与乘法相反)。因此,您最终会序列化以访问 fp 单元。
我确信如果您使用整数类型或乘法,您会看到加速。
Well, you are also using doubles and division. Can you make the division into multiplication?
The floating point unit is shared among the cores and divisions do not have a deterministic number of cycles till completion (as opposed to multiplication). So you end up serializing for accessing the fp unit.
I'm sure that if you use integral types or multiplication, you'll see a speedup.