OpenMP 并行循环比常规循环慢得多
整个程序已经缩小了一个简单的测试:
const int loops = 1e10;
int j[4] = { 1, 2, 3, 4 };
time_t time = std::time(nullptr);
for (int i = 0; i < loops; i++) j[i % 4] += 2;
std::cout << std::time(nullptr) - time << std::endl;
int k[4] = { 1, 2, 3, 4 };
omp_set_num_threads(4);
time = std::time(nullptr);
#pragma omp parallel for
for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
std::cout << std::time(nullptr) - time << std::endl;
在第一种情况下,通过循环运行大约需要3秒钟,在第二种情况下,结果是不一致的,可能是4-9秒。这两个循环启用了一些优化的速度(例如,偏爱速度和整个程序优化),但是第二个循环仍然明显较慢。我尝试在循环的末尾添加障碍,并明确将数组指定为共享
,但这无济于事。我设法使并行循环运行速度更快的唯一情况是使循环空。有什么问题?
Windows 10 X64,CPU Intel Core i5 10300h(4核)
The whole program has been shrunk to a simple test:
const int loops = 1e10;
int j[4] = { 1, 2, 3, 4 };
time_t time = std::time(nullptr);
for (int i = 0; i < loops; i++) j[i % 4] += 2;
std::cout << std::time(nullptr) - time << std::endl;
int k[4] = { 1, 2, 3, 4 };
omp_set_num_threads(4);
time = std::time(nullptr);
#pragma omp parallel for
for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
std::cout << std::time(nullptr) - time << std::endl;
In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared
, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?
Windows 10 x64, CPU Intel Core i5 10300H (4 cores)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如各种评论中已经指出的那样,问题的关键是 false共享。确实,您的示例是可以实验的典型情况。但是,您的代码中也有很多问题,例如:
loops
变量以及所有j
和k 表;
i%4
公式,并添加了schedule(static,1)
子句。这不是一种正确的方法,但是只能在不使用正确的降低
子句的情况下获得预期的结果。然后,我重写了您的示例,并以我认为是错误共享问题的更好解决方案:使用
降低
条款。在我的笔记本电脑上进行编译和运行,而没有优化:
关于改进的
减少
条款带来的说法。现在,启用编译器的优化给出了更加缓解的图片:
如果有的话,这表明编译器如今非常擅长避免当今的大多数虚假共享。实际上,使用您的初始(错误)
k [op_get_thread_num()]
,与降低
条款没有时间差异,表明编译器能够避免问题。As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:
loops
variable and in all of yourj
andk
tables;i%4
formula and added aschedule( static, 1)
clause. This is not a proper way of doing it, but it was only to get the expected results without using the correctreduction
clause.Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a
reduction
clause.Compiling and running without optimizations gives on my laptop:
Which speaks for itself in regard to the improvement the
reduction
clause brings.Now, enabling optimizations from the compiler gives a more mitigated picture:
If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous)
k[omp_get_thread_num()]
, there was no time difference with and without thereduction
clause, showing that the compiler was able to avoid the issue.