当前位置：文江博客话题详情

CPU 上的数组并行缩减

发布于 2025-01-08 06:46:54 字数 179 浏览 4 评论 0原文

有没有办法在 C/C++ 中并行减少 CPU 上的数组？我最近了解到，使用是不可能的打开mp。还有其他选择吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话已封尘 2025-01-15 06:46:54

添加：请注意，您可以使用 OpenMP 实现“自定义”缩减，按照此处。

对于 C++：在 Intel 的 TBB 中使用 parallel_reduce （SO 标签：tbb)，您可以减少数组和结构体等复杂类型。尽管与 OpenMP 的缩减条款相比，所需的代码量可能要大得多。

作为示例，让我们并行化矩阵到向量乘法的简单实现：y=Cx。串行代码由两个循环组成：

double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
  for(int j=0; j<M; ++j) 
    y[j] += C[i][j]*x[i];

通常，为了并行化，交换循环以使外循环迭代独立并并行处理它们：

#pragma omp parallel for
for(int j=0; j<M; ++j) 
  for(int i=0; i<N; ++i)
    y[j] += C[i][j]*x[i];

但这并不总是好主意。如果 M 很小而 N 很大，交换循环将无法提供足够的并行性（例如，考虑计算加权 M 维空间中 N 个点的质心，其中 C 是点数组，x 是权重数组）。因此，减少一个数组（即一个点）会很有帮助。以下是如何使用 TBB 来完成此操作（抱歉，代码未经测试，可能会出现错误）：

struct reduce_body {
  double y_[M]; // accumulating vector
  double (& C_)[N][M]; // reference to a matrix
  double (& x_)[N];    // reference to a vector

  reduce_body( double (&C)[N][M], double (&x)[N] )  : C_(C), x_(x) {
    for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
  }
  // splitting constructor required by TBB
  reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) { 
    for (int j=0; j<M; ++j) y_[j] = 0.0;
  }
  // the main computation method
  void operator()(const tbb::blocked_range<int>& r) {
    // closely resembles the original serial loop
    for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
      for (int j=0; j<M; ++j)
        y_[j] += C_[i][j]*x_[i];
  }
  // the method to reduce computations accumulated in two bodies
  void join( reduce_body& rb ) {
    for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
  }
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
  y[j] = body.y_[j]; // copy to the destination array

免责声明：我隶属于 TBB。

Added: Note that you can implement "custom" reduction with OpenMP, in the way described here.

For C++: with parallel_reduce in Intel's TBB (SO tag: tbb), you can make reduction on complex types such as arrays and structs. Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause.

As an example, let's parallelize a naive implementation of matrix-to-vector multiplication: y=Cx. Serial code consists of two loops:

double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
  for(int j=0; j<M; ++j) 
    y[j] += C[i][j]*x[i];

Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel:

#pragma omp parallel for
for(int j=0; j<M; ++j) 
  for(int i=0; i<N; ++i)
    y[j] += C[i][j]*x[i];

However it's not always good idea. If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with C being the array of points and x being the array of weights). So a reduction over an array (i.e. a point) would be helpful. Here is how it can be done with TBB (sorry, the code was not tested, errors are possible):

struct reduce_body {
  double y_[M]; // accumulating vector
  double (& C_)[N][M]; // reference to a matrix
  double (& x_)[N];    // reference to a vector

  reduce_body( double (&C)[N][M], double (&x)[N] )  : C_(C), x_(x) {
    for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
  }
  // splitting constructor required by TBB
  reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) { 
    for (int j=0; j<M; ++j) y_[j] = 0.0;
  }
  // the main computation method
  void operator()(const tbb::blocked_range<int>& r) {
    // closely resembles the original serial loop
    for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
      for (int j=0; j<M; ++j)
        y_[j] += C_[i][j]*x_[i];
  }
  // the method to reduce computations accumulated in two bodies
  void join( reduce_body& rb ) {
    for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
  }
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
  y[j] = body.y_[j]; // copy to the destination array

Disclaimer: I am affiliated with TBB.

回复收藏 0 原文

~没有更多了~