OpenMP 用于嵌套 for 循环?
我正在尝试在我的项目中使用 OpenMP,该项目在模拟中包含 N 个代理。每个智能体都有一个位置向量。在我的程序中,我执行如下操作:
for(int i=0;i<agents.size();i++){
for(int j=0;j<agents.size();j++){
if(i==j) continue;
if(distance_between(i,j)<10){
//do a lot of calculations for interaction between i and j,
//and a bunch of changes to variables of the Agents stored in the list
}
}
}
我是否可以简单地在第一个循环前面添加“#pragma omp parallel for”来分配工作?如果我这样做,我的程序运行得更快,但我担心它所做的计算可能不准确。
这些计算的顺序并不重要,只要处理每个 i,j 对,并且最内层循环内定义的变量之外没有任何变量,并且我的 Agents 类中的变量曾经更改(仅读取)。
作为一个系统人员,我很难理解 OpenMP 的内部工作原理,并且对它产生了一些怀疑。感谢您对此的任何帮助!
I am attempting to use OpenMP in my project that contains N agents in a simulation. Every agent has a position vector. Inside my program, I do something like the following:
for(int i=0;i<agents.size();i++){
for(int j=0;j<agents.size();j++){
if(i==j) continue;
if(distance_between(i,j)<10){
//do a lot of calculations for interaction between i and j,
//and a bunch of changes to variables of the Agents stored in the list
}
}
}
Am I able to simply prepend "#pragma omp parallel for" in front of the first loop to distribute the work? If I do it, my program runs much faster, but I am concerned that the calculations it is doing may not be accurate.
The order of these calculations does not matter, as long as every i,j pair is processed, and no variable outside of the variables defined inside the innermost loop, and the variables in my Agents class are ever changed (only read).
Not being a systems guy I have trouble understanding inner workings of OpenMP, and have become slightly suspicious of it. Thanks for any help with this!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
是的,添加编译指示是正确的。该内存被认为是由所有线程隐式共享的。但这并不意味着它会运行得更快。有多个问题需要考虑:
Yes, adding the pragma would be correct. The memory is considered implicitly shared by all threads. Still that doesn't mean it will work faster. There are multiple issues to be considered:
这取决于你在并行部分会做什么。为了获得更好的性能,您可以更改并行代码的算法,而不是仅仅将
#pragma omp parallel
放入序列代码中。请记住,并行编程的关键是共享变量。对于某些情况,使用序列代码比使用并行代码更好。事实上,您必须确保代码中不存在竞争条件才能获得正确的计算结果。
It depends on what would you do in parallel section. To gain better performance you might change the algorithm of your parallel code rather than just put the
#pragma omp parallel
to your sequence code. Remember that, the key of parallel programming are the shared variables. For some cases, it would be better to use sequence code than parallel one.Indeed, you have to make sure that there's nothing a race-condition in your code to get the correct calculation result.
确保影响相同数据的交互
i,j
和j,i
不会导致数据争用。您可以仔细调整工作的分布,或者在必要时使用omp critical
或ompatomic
结构。除此之外,通过在外循环上使用
omp parallel for
结构来使代码运行得更快完全没有问题。请记住,如果您的处理器具有一定数量的处理单元(核心、硬件线程等),或者您使用的是 ccNUMA 机器,那么做一些额外的事情可能是一个好主意,因为代码的可扩展性将不那么好。最简单的改进是将
collapse(2)
添加到omp for
子句中。这将告诉 OpenMP 您希望将这两个循环的迭代完全分布。一旦你接触到大量的粒子(或者你所说的代理),根据它们的位置对它们进行排序可能是明智的。这将导致最内层循环内的
if
条件具有更加稳定的行为,从而可以更好地预测分支(请参阅 为什么处理排序数组更快)。此外,如果将几何空间划分为 bin,则可以完全删除分支。这是因为您确切地知道不相邻的存储桶距离太远,条件无法成功。
这类问题是众所周知的,您可能已经知道,它们被称为 n-body问题。存储桶优化版本称为 Barnes-Hut 模拟,尽管您可能发现其他方法工作得更快(虽然更难并行化,但大多数时候仍然更有效),例如 快速多极方法,通过在同一步骤中解决
i,j
和j,i
相互作用来或多或少地减少计算量。Make sure interactions
i,j
andj,i
which affect the same data don't cause data races. You can do so carefully tuning the distribution of the work or by usingomp critical
oromp atomic
constructs where necessary.Apart from that, there is no problem at all on making your code run faster by just using
omp parallel for
constructs on the outer loop. Bear in mind that if the amount of processing units (cores, hwthreads, etc.) your processor has, or if you are using a ccNUMA machine, it might be a nice idea to do some additional things, as the scalability of your code will not be as good as it could.The easiest improvement is to add
collapse(2)
to youromp for
clause. This will tell OpenMP that you want to distribute iterations of both of those loops altogether.Once you reach huge amounts of particles (or agents as you call them), it might be wise to sort all of them according to their position. This will cause your
if
condition inside the innermost loop to have a much more stable behavior, thus allowing a better prediction of the branches (see why is it faster to process a sorted array).In addition, you can delete the branch completely if you divide the geometric space in bins. This is because you exactly know that non-adjacent buckets are too far for the condition to become successful.
This kind of problems is very well known and, as you might already know, they are called n-body problems. The bucket optimization version is called Barnes-Hut simulation, although you might find other approaches working much faster (although more difficult to parallelize, still are more efficient most of the time) such as the Fast Multipole Method, which consists more or less in reducing the number of computations by solving both interactions
i,j
andj,i
in the same step.