OpenMP。 n线程中环的并行化
我试图将5000万次迭代的周期与几个线程平行 - 首先乘以1,然后以4、8和16的速度。下面是实现此功能的代码。
#include <iostream>
#include <omp.h>
using namespace std;
void someFoo();
int main() {
someFoo();
}
void someFoo() {
long sum = 0;
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
start = omp_get_wtime();
#pragma omp parallel for
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
#pragma omp end parallel
end = omp_get_wtime();
cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
}
}
预计该程序在4个线程中的运行速度将比1个线程快于1个线程快于4个线程,而在16个线程中的运行速度将比8个线程快,但实际上不是8个线程 - 事实并非如此 - 一切都在混乱中执行速度,几乎没有区别。同样,在任务管理器中看不到该程序并行化。我有一台具有8个逻辑处理器和4个内核的计算机。
请告诉我我在哪里犯了一个错误,以及如何在n线程中正确地平行循环。
I am trying to parallelize a cycle of 50 million iterations with several threads - first by 1, then by 4, 8 and 16. Below is the code for implementing this functionality.
#include <iostream>
#include <omp.h>
using namespace std;
void someFoo();
int main() {
someFoo();
}
void someFoo() {
long sum = 0;
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
start = omp_get_wtime();
#pragma omp parallel for
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
#pragma omp end parallel
end = omp_get_wtime();
cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
}
}
It is expected that in 4 threads the program will run faster than in 1, in 8 threads faster than in 4 threads, and in 16 threads faster than in 8 threads, but in practice this is not the case - everything is executed at a chaotic speed and there is almost no difference. Also, it is not visible in the task manager that the program is parallelized. I have a computer with 8 logical processors and 4 cores.
Please tell me where I made a mistake and how to properly parallelize the loop in N threads.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
代码中有一个种族条件,因为
sum
同时从多个线程读取/编写。这应该导致错误的结果。您可以使用Directive#pragma OMP并行进行减少(+:SUM)
并行使用降低。请注意,OpenMP不检查您的循环是否可以并行,这是您的责任。此外,并行计算可能比顺序计算要慢,因为聪明的编译器可以看到
sum = 50000000*(50000000-1)/2*10 = 1249999999750000000
(Afaik,Clang,Clang做到了)。结果,基准肯定是有缺陷的。请注意,这当然比长
可以包含的类型大,因此代码中肯定有一个溢出。此外,Afaik,没有指令
#pragma op end Parallel
之类的东西。最后,请注意,您可以使用
OMP_NUM_THREADS
环境变量来控制线程数,通常比在应用程序中设置它更方便(在应用程序代码中使用硬线数量的线程通常不是一个好的线程想法,甚至用于基准)。There is a race condition in your code because
sum
is read/written from multiple threads at the same time. This should cause wrong results. You can fix this using a reduction with the directive#pragma omp parallel for reduction(+:sum)
. Note that OpenMP does not check if your loop can be parallelized, it is your responsibility.Additionally, the parallel computation might be slower than the the sequential one since a clever compiler can see that
sum = 50000000*(50000000-1)/2*10 = 12499999750000000
(AFAIK, Clang does that). As a result, the benchmark is certainly flawed. Note that this is certainly bigger than what the typelong
can contain so there is certainly an overflow in your code.Moreover, AFAIK, there is no such thing as the directive
#pragma omp end parallel
.Finally, note that you can control the number of thread using the
OMP_NUM_THREADS
environment variable which is generally more convenient than setting it in your application (Hardwiring a given number of thread in the application code is generally not a good idea, even for benchmarks).首先,您需要在代码示例中修复一些编译器问题。就像删除Pragmas一样
#pragma Omp end Parallel
,正确声明变量等等。第二,您需要在更新变量sum
的更新期间修复竞赛条件。该变量在线程之间共享一个同时更新。最简单的方法是使用OpenMP的减少
子句,您的代码看起来如下:使用多核运行时,您应该获得一些加速。
注意:解决溢出首先提到的 @jérômeRichard,我将“总和”变量从 long 更改为a double> double> /em>。
First you need to fix some of the compiler issues on your code example. Like removing pragmas like
#pragma omp end parallel
, declaring the variables correctly and so on. Second you need to fix the race condition during the update of the variablesum
. That variable is shared among threads an updated concurrently. The easiest way would be to use thereduction
clause of OpenMP, you code would look like the following:With that you should get some speedups when running with multi-cores.
NOTE: To solve the overflow mentioned first by @Jérôme Richard, I change the 'sum' variable from long to a double.