OpenMP 缓慢还原

发布于 2024-11-14 15:21:01 字数 1529 浏览 0 评论 0原文

我编写了简单的 C++ 代码来计算数组缩减总和,但使用 OpenMP 缩减程序运行缓慢。程序有两种变体:一种是最简单的求和,另一种是复杂数学函数的求和。在代码中,复杂的变体被注释。

#include <iostream>
#include <omp.h>
#include <math.h>

using namespace std;

#define N 100000000
#define NUM_THREADS 4

int main() {

  int *arr = new int[N];

  for (int i = 0; i < N; i++) {
    arr[i] = i;
  }

  omp_set_num_threads(NUM_THREADS);
  cout << NUM_THREADS << endl;

  clock_t start = clock();
  int sum = 0;
  #pragma omp parallel for reduction(+:sum)
  for (int i = 0; i < N; i++) {
    // sum += sqrt(sqrt(arr[i] * arr[i])); // complex variant
    sum += arr[i]; // simple variant
  }

  double diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
  cout << "Time " << diff << "s" << endl;

  cout << sum << endl;

  delete[] arr;

  return 0;
}

我通过 ICPC 和 GCC 编译它:

icpc reduction.cpp -openmp -o reduction -O3
g++ reduction.cpp -fopenmp -o reduction -O3

处理器:Intel Core 2 Duo T5850,操作系统:Ubuntu 10.10

有简单和复杂变体的执行时间,使用和不使用 OpenMP 编译。

简单变体“sum += arr[i];”:

icpc
0.1s without OpenMP
0.18s with OpenMP

g++
0.11c without OpenMP
0.17c with OpenMP

复杂变体“sum += sqrt(sqrt(arr[i] * arr[i]));” :

icpc
2,92s without OpenMP
3,37s with OpenMP

g++ 
47,97s without OpenMP
48,2s with OpenMP

在系统监视器中,我看到 2 个核心在使用 OpenMP 的程序中工作,1 个核心在没有 OpenMP 的程序中工作。我将在 OpenMP 中尝试多个线程,但没有加速。我不明白为什么减少很慢。

I write simple C++ code that compute array reduction sum, but with OpenMP reduction program works slowly. There are two variants of program: one is simplest sum, another - sum of complex math function. In code complex variant is commented.

#include <iostream>
#include <omp.h>
#include <math.h>

using namespace std;

#define N 100000000
#define NUM_THREADS 4

int main() {

  int *arr = new int[N];

  for (int i = 0; i < N; i++) {
    arr[i] = i;
  }

  omp_set_num_threads(NUM_THREADS);
  cout << NUM_THREADS << endl;

  clock_t start = clock();
  int sum = 0;
  #pragma omp parallel for reduction(+:sum)
  for (int i = 0; i < N; i++) {
    // sum += sqrt(sqrt(arr[i] * arr[i])); // complex variant
    sum += arr[i]; // simple variant
  }

  double diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
  cout << "Time " << diff << "s" << endl;

  cout << sum << endl;

  delete[] arr;

  return 0;
}

I compile it by ICPC and GCC:

icpc reduction.cpp -openmp -o reduction -O3
g++ reduction.cpp -fopenmp -o reduction -O3

Processor: Intel Core 2 Duo T5850, OS: Ubuntu 10.10

There are execution time of simple and complex variants, compiled with and without OpenMP.

Simple variant "sum += arr[i];":

icpc
0.1s without OpenMP
0.18s with OpenMP

g++
0.11c without OpenMP
0.17c with OpenMP

Complex variant "sum += sqrt(sqrt(arr[i] * arr[i]));":

icpc
2,92s without OpenMP
3,37s with OpenMP

g++ 
47,97s without OpenMP
48,2s with OpenMP

In system monitor I see that 2 cores works in program with OpenMP and 1 core works in program without OpenMP. I'll try several numbers of threads in OpenMP and dont have speedup. I don't understand why reduction is slow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

守不住的情 2024-11-21 15:21:01

函数clock()测量整个进程消耗的处理器时间,因此打印的时间显示所有线程消耗的时间总和。如果您想查看墙上时间(从开始到结束的实时时间),请使用例如 times() 函数

The function clock() measures processor time consumed by whole process, so printed time shows sum of time consumed by all threads. If you want to see wall-time (real time elapsed from the begin to the end), use e.g. times() function on the POSIX system

春花秋月 2024-11-21 15:21:01

您所做的事情非常简单,以至于您可能会受到内存带宽的限制。在工作量远远超过从工作中获取数据和从工作中获取数据所需的时间之前,我很少获得任何加速。另外,归约在合并所有子结果方面需要额外的工作。

What you're doing is so simple that you're probably being limited by memory bandwidth. I rarely get any speedups until the work is much more than the time it takes to get the data to and from the work. Plus a reduction has extra work in merging all the sub results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文