openmp 与非 POD 并行

发布于 2024-11-14 13:31:14 字数 1667 浏览 4 评论 0 原文

我正在尝试加速一个程序，该程序的核心是一个看起来微不足道的循环：

double sum=0.;
#pragma omp parallel for reduction(+:sum) // fails
for( size_t i=0; i<_S.size(); ++i ){
  sum += _S[i].first* R(atoms,_S[i].second) ;
}

虽然循环本身很微不足道，但其中的对象不是 POD：这里 _S 实际上是一个 std::向量< std::pair; > >，并且 R(...) 是某个对象的重载 operator(...) const。它的两个参数都限定为 const，因此调用不会产生任何副作用。

由于大约 90% 的运行时间都花在这个调用上，因此如上所示，添加 OpenMP 编译指示似乎是一件简单的事情，并享受两到三倍的加速；当然，该代码在单个线程中工作正常，但在多个线程中给出明显错误的结果:-)。

没有数据依赖性，_S 和 R(...) 似乎可以安全地在线程之间共享，但仍然会产生无意义的结果。

我真的很感激任何关于如何找出问题所在的指示。

UPD2：

想通了。与所有错误一样，这都是微不足道的。 R(...) 正在调用此类的 operator()：

class objR{
 public:
   objR(const size_t N){
     _buffer.reserve(N);
   };
   double operator(...) const{
     // do something, using the _buffer to store intermediaries
   }

 private:
   std::vector<double> _buffer;
};

显然，不同的线程同时使用 _buffer时间并把它搞砸了。到目前为止，我的解决方案是分配更多空间（内存不是问题，代码受 CPU 限制）：

class objR{
 public:
   objR(const size_t N){
     int nth=1;
     #ifdef _OPENMP
       nth=omp_get_max_threads();
     #endif
     _buffer.resize(N);
   }

   double operator(...) const{
     int thread_id=0;
     #ifdef _OPENMP
       thread_id = omp_get_thread_num();
     #endif
     // do something, using the _buffer[thread_id] to store intermediaries
   }

 private:
   std::vector< std::vector<double> > _buffer;
};

这似乎工作正常。尽管如此，由于这是我第一次涉足多线程，如果有知识渊博的人可以评论是否有更好的方法，我将不胜感激。

原文

I'm trying to speed up a program, in the heart of which is a trivially-looking loop:

double sum=0.;
#pragma omp parallel for reduction(+:sum) // fails
for( size_t i=0; i<_S.size(); ++i ){
  sum += _S[i].first* R(atoms,_S[i].second) ;
}

While the looping itself is trivial, the objects inside it are not PODs: Here _S is in fact an
std::vector< std::pair<double, std::vector<size_t> > >, and R(...) is an overloaded operator(...) const of some object. Both of its arguments qualified as const, so that the call does not have any side effects.

Since some 90% of the runtime is spent in this call, it seemed a simple thing to throw in an OpenMP pragma as shown above, and enjoy a speedup by a factor of two or three;
but of course --- the code works OK with a single thread, but gives plain wrong results for more than one thread :-).

There is no data dependency, both _S and R(...) seem to be safe to share between threads, but still it produces nonsense.

I'd really appreciate any pointers into how to find what goes wrong.

UPD2:

Figured it. As all bugs, it's trivial. The R(...) was calling the operator() of something of this sort:

class objR{
 public:
   objR(const size_t N){
     _buffer.reserve(N);
   };
   double operator(...) const{
     // do something, using the _buffer to store intermediaries
   }

 private:
   std::vector<double> _buffer;
};

Clearly, different threads use the _buffer at the same time and mess it up. My solution so far is to allocate more space (memory is not a problem, the code is CPU-bound):

class objR{
 public:
   objR(const size_t N){
     int nth=1;
     #ifdef _OPENMP
       nth=omp_get_max_threads();
     #endif
     _buffer.resize(N);
   }

   double operator(...) const{
     int thread_id=0;
     #ifdef _OPENMP
       thread_id = omp_get_thread_num();
     #endif
     // do something, using the _buffer[thread_id] to store intermediaries
   }

 private:
   std::vector< std::vector<double> > _buffer;
};

This seems to work correctly. Still, since it's my first foray into things mutithreaded, I'd appreciate if somebody knowledgeable could comment on if there is a better approach.

分享到QQ

分享到微博