OpenMP 用于嵌套 for 循环？

发布于 2024-09-16 04:41:48 字数 620 浏览 2 评论 0原文

我正在尝试在我的项目中使用 OpenMP，该项目在模拟中包含 N 个代理。每个智能体都有一个位置向量。在我的程序中，我执行如下操作：

for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
 }
}

我是否可以简单地在第一个循环前面添加“#pragma omp parallel for”来分配工作？如果我这样做，我的程序运行得更快，但我担心它所做的计算可能不准确。

这些计算的顺序并不重要，只要处理每个 i,j 对，并且最内层循环内定义的变量之外没有任何变量，并且我的 Agents 类中的变量曾经更改（仅读取）。

作为一个系统人员，我很难理解 OpenMP 的内部工作原理，并且对它产生了一些怀疑。感谢您对此的任何帮助！

原文

I am attempting to use OpenMP in my project that contains N agents in a simulation. Every agent has a position vector. Inside my program, I do something like the following:

for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
 }
}

Am I able to simply prepend "#pragma omp parallel for" in front of the first loop to distribute the work? If I do it, my program runs much faster, but I am concerned that the calculations it is doing may not be accurate.

The order of these calculations does not matter, as long as every i,j pair is processed, and no variable outside of the variables defined inside the innermost loop, and the variables in my Agents class are ever changed (only read).

Not being a systems guy I have trouble understanding inner workings of OpenMP, and have become slightly suspicious of it. Thanks for any help with this!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぽ尐不点ル 2024-09-23 04:41:48

是的，添加编译指示是正确的。该内存被认为是由所有线程隐式共享的。但这并不意味着它会运行得更快。有多个问题需要考虑：

您的系统有多少个处理器？
你使用整数还是浮点数？对于整数，顺序并不重要，但对于浮点则不然，
哪些变量只能由最内层循环访问？您可以将它们声明为私有以获得更好的性能。

回复收藏 0 原文

千寻… 2024-09-23 04:41:48

我可以简单地在第一个循环前面添加“#pragma omp parallel for”来分配工作吗？

这取决于你在并行部分会做什么。为了获得更好的性能，您可以更改并行代码的算法，而不是仅仅将 #pragma omp parallel 放入序列代码中。请记住，并行编程的关键是共享变量。对于某些情况，使用序列代码比使用并行代码更好。

如果我这样做，我的程序运行得更快，但我担心它所做的计算可能不准确。

事实上，您必须确保代码中不存在竞争条件才能获得正确的计算结果。

回复收藏 0 原文

夜雨飘雪 2024-09-23 04:41:48

确保影响相同数据的交互 i,j 和 j,i 不会导致数据争用。您可以仔细调整工作的分布，或者在必要时使用 omp critical 或 ompatomic 结构。

除此之外，通过在外循环上使用 omp parallel for 结构来使代码运行得更快完全没有问题。请记住，如果您的处理器具有一定数量的处理单元（核心、硬件线程等），或者您使用的是 ccNUMA 机器，那么做一些额外的事情可能是一个好主意，因为代码的可扩展性将不那么好。

最简单的改进是将 collapse(2) 添加到 omp for 子句中。这将告诉 OpenMP 您希望将这两个循环的迭代完全分布。

#pragma omp parallel for collapse(2)
for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
  }
}

一旦你接触到大量的粒子（或者你所说的代理），根据它们的位置对它们进行排序可能是明智的。这将导致最内层循环内的 if 条件具有更加稳定的行为，从而可以更好地预测分支（请参阅为什么处理排序数组更快）。

此外，如果将几何空间划分为 bin，则可以完全删除分支。这是因为您确切地知道不相邻的存储桶距离太远，条件无法成功。

这类问题是众所周知的，您可能已经知道，它们被称为 n-body问题。存储桶优化版本称为 Barnes-Hut 模拟，尽管您可能发现其他方法工作得更快（虽然更难并行化，但大多数时候仍然更有效），例如快速多极方法，通过在同一步骤中解决i,j 和j,i 相互作用来或多或少地减少计算量。

Make sure interactions i,j and j,i which affect the same data don't cause data races. You can do so carefully tuning the distribution of the work or by using omp critical or omp atomic constructs where necessary.

Apart from that, there is no problem at all on making your code run faster by just using omp parallel for constructs on the outer loop. Bear in mind that if the amount of processing units (cores, hwthreads, etc.) your processor has, or if you are using a ccNUMA machine, it might be a nice idea to do some additional things, as the scalability of your code will not be as good as it could.

The easiest improvement is to add collapse(2) to your omp for clause. This will tell OpenMP that you want to distribute iterations of both of those loops altogether.

#pragma omp parallel for collapse(2)
for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
  }
}

Once you reach huge amounts of particles (or agents as you call them), it might be wise to sort all of them according to their position. This will cause your if condition inside the innermost loop to have a much more stable behavior, thus allowing a better prediction of the branches (see why is it faster to process a sorted array).

In addition, you can delete the branch completely if you divide the geometric space in bins. This is because you exactly know that non-adjacent buckets are too far for the condition to become successful.

This kind of problems is very well known and, as you might already know, they are called n-body problems. The bucket optimization version is called Barnes-Hut simulation, although you might find other approaches working much faster (although more difficult to parallelize, still are more efficient most of the time) such as the Fast Multipole Method, which consists more or less in reducing the number of computations by solving both interactions i,j and j,i in the same step.

回复收藏 0 原文

~没有更多了~