OpenMP C++双核笔记本电脑的并行性能优于八核集群
首先,OpenMP 显然只在集群中的一个主板上运行,在这种情况下,每个主板都有两个 2GHz 的四核 Xeons E5405 及其运行的 Scientific Linux 5.3(2009 年发布,基于 Red Hat)。另一方面,我的笔记本电脑配备 core2duo T7300,主频为 2GHz,运行 Windows 7。两台机器都没有超线程。
主要问题是,我的 OOP 代码通常在两个系统中串行运行大约 2 分钟,但是当我在嵌套循环中实现 OpenMP 时,它会在我的笔记本电脑中实现预期的时间减少(当使用 2 个线程时),并且显着减少服务器中的时间增加(例如,使用两个线程大约 5 分钟)。
有两个类,“立方体”和“空间”。空间包含立方体的三维数组 (20x20x20),我尝试并行化的代码是一个三向嵌套循环,它为每个立方体调用立方体的成员函数。该成员函数具有三个参数(双精度),并根据每个多维数据集的私有变量进行一些计算。
inline void space::cubes_refresh(const double vsx, const double vsy, const double vsz) {
int loopx, loopy, loopz;
#pragma omp parallel private(loopx, loopy, loopz)
{
#pragma omp for schedule(guided,1) nowait
for(loopx=0 ; loopx<cubes_w ; loopx++) {
for(loopy=0 ; loopy<cubes_h ; loopy++) {
for(loopz=0 ; loopz<cubes_d ; loopz++) {
// Refreshing the values in source
if ( (loopx==source_x)&&(loopy==source_y)&&(loopz==source_z) )
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,vsz);
// refresh everything else
else
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,0.0);
}
}
} // End of loop
}
我不知道问题可能出在哪里,正如我之前所说,在我的笔记本电脑中,我看到了预期的性能改进,但服务器中完全相同的代码却明显变差。 这些是我在笔记本电脑中使用的标志(已尝试使用完全相同的标志,但什么也没有):
g++ -std=c++98 -fopenmp -O3 -Wl,--enable-auto-import -pedantic main.cpp -o parallel_openmp
在服务器中:
g++ -std=c++98 -fopenmp -O3 -W -pedantic main.cpp -o parallel_openmp
我正在运行 gcc 版本 4.5.0,服务器正在运行 4.1.2,我不知道服务器中的 OpenMP 版本,因为我不知道如何检查它,我认为是 3.0 之前的版本,因为循环崩溃不起作用。这可能是问题所在吗?
First of all, OpenMP obviously only runs in one of the motherboards in the cluster, in this case each motherboard has two quad-core Xeons E5405 at 2GHz and its running Scientific Linux 5.3 (released in 2009, red hat based). My laptop on the other hand a has core2duo T7300 at 2GHz running windows 7. No hyperthreading in either machine.
The main problem is that I have OOP code that generally runs for around 2min in serial in both systems, but when I implement OpenMP in a nested loop it experieces an expected reduction in time in my laptop (when 2 threads are used) and a significant increase in time in the server (around 5min with two threads, for example).
There are two classes, "cube" and "space". Space contains a three dimensional array (20x20x20) of cubes and the code that I am trying to parallelise is a three way nested loop that calls a member function of cube for each of the cubes. This member function has three arguments (doubles) and does some calculations based on the private variables of each cube.
inline void space::cubes_refresh(const double vsx, const double vsy, const double vsz) {
int loopx, loopy, loopz;
#pragma omp parallel private(loopx, loopy, loopz)
{
#pragma omp for schedule(guided,1) nowait
for(loopx=0 ; loopx<cubes_w ; loopx++) {
for(loopy=0 ; loopy<cubes_h ; loopy++) {
for(loopz=0 ; loopz<cubes_d ; loopz++) {
// Refreshing the values in source
if ( (loopx==source_x)&&(loopy==source_y)&&(loopz==source_z) )
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,vsz);
// refresh everything else
else
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,0.0);
}
}
} // End of loop
}
I don't know where the problem could be, as I have said before, in my laptop I see an expected improvement in performance, but exactly the same code in the server does significantly worse.
These are the flags I use in my laptop (have tried using exactly the same flags, but nothing):
g++ -std=c++98 -fopenmp -O3 -Wl,--enable-auto-import -pedantic main.cpp -o parallel_openmp
And in the server:
g++ -std=c++98 -fopenmp -O3 -W -pedantic main.cpp -o parallel_openmp
I'm running gcc version 4.5.0 and the server is running 4.1.2, I don' know the OpenMP version in the server as I don't know how to check it, I think is a version before 3.0 as the collapse in loops does not work. Could this be the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
gcc 在 4.2 之前不支持 OpenMP,从 开始支持 OpenMP 3.0 ="http://gcc.gnu.org/wiki/openmp" rel="nofollow">gcc 4.4。您的操作系统供应商可能已将更改移植到 4.1.2。
gcc did not support OpenMP until 4.2, OpenMP 3.0 was supported starting in gcc 4.4. Your operating system vendor may have back ported the changes to 4.1.2.
我认为可能导致问题的唯一原因是,由于某种原因,在服务器中访问多维数据集成员数组的所有线程都会导致大量缓存未命中,但是在我的笔记本电脑中运行的程序中不会也会发生这种情况吗?
The only thing I can think maybe causing the problem is that for some reason in the server all the threads accessing the cube member array is causing a lot cache misses, but wouldn't this also happen in the program running in my laptop?