为什么这个 OpenMP 程序比单线程慢?
请看一下这段代码。
单线程程序:http://pastebin.com/KAx4RmSJ。编译为:
g++ -lrt -O2 main.cpp -o nnlv2
使用 openMP 的多线程: http://pastebin.com/fbe4gZSn 编译为:
g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp
我在双核系统上测试了它(所以我们有两个线程并行运行)。但多线程版本比单线程版本慢(并且显示时间不稳定,尝试运行几次)。怎么了?我哪里做错了?
一些测试:
单线程:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 1898983
10 500 500 --- 11009094
10 1000 1000 --- 48116913
多线程:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 2518262
10 500 500 --- 13861504
10 1000 1000 --- 53446849
我不明白出了什么问题。
Please look at this code.
Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:
g++ -lrt -O2 main.cpp -o nnlv2
Multithread with openMP: http://pastebin.com/fbe4gZSn
Compiled with:
g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp
I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?
Some tests:
Single-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 1898983
10 500 500 --- 11009094
10 1000 1000 --- 48116913
Multi-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 2518262
10 500 500 --- 13861504
10 1000 1000 --- 53446849
I don't understand what is wrong.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的目标是学习 OpenMP,还是让您的程序更快?如果是后者,则更值得编写乘加代码、减少传递次数并合并 SIMD。
步骤 1:组合循环并使用乘加:
Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.
Step 1: Combine loops and use multiply-add:
使用 -static 和 -p 进行编译,运行然后使用 gprof 解析 gmon.out 我得到:
45.65% gomp_barrier_wait_end
这是 opemmp 的屏障例程中的大量时间。这是等待其他线程完成所花费的时间。由于您多次运行并行 for 循环(层),因此您失去了并行运行的优势,因为每次并行 for 循环完成时,都会有一个隐式屏障调用,直到所有其他线程完成后才会返回。
compiling with -static and -p, running and then parsing gmon.out with gprof I got:
45.65% gomp_barrier_wait_end
That's a lot of time in opemmp's barrier routine. that is the time spent waiting for the other threads to finish. since you're running the parallel for loops many times (LAYERS), you loose the advantage of running in parallel since every time a parallel for loop is finished, there is an implicit barrier call which won't return till all other threads finish.
首先,在多线程配置上运行测试,并确保 procexp 或任务管理器将显示 100% 的 CPU 使用率。如果没有,那么您就不会使用多线程或多处理器核心。
另外,摘自 wiki:
环境变量
一种改变 OpenMP 应用程序执行功能的方法。用于控制循环迭代调度、默认线程数等。例如 OMP_NUM_THREADS 用于指定应用程序的线程数。
Before anything else, run the test on Multi-thread configuration and MAKE SURE that procexp or task manager will show you 100% CPU usage for it. If it doesn't, then you don't use multiple threads nor multiple processor cores.
Also, taken from wiki:
Environment variables
A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.
我不知道您在哪里实际使用了 OpenMP - 尝试在主循环上方使用 #pragma omp parallel... (记录在此处)
速度缓慢可能是包括 OpenMP 及其初始化、添加代码膨胀或以其他方式由于您引入的编译器标志来启用它而更改编译。或者,循环非常小且简单,以至于线程的开销远远超过了性能增益。
I don't see where you have actually used OpenMP - try #pragma omp parallel for above the main loop... (documented here, for example)
The slowness is possibly from including OpenMP and it initialising, adding code bloat or otherwise changing the compilation as a result of the compiler flags you introduced to enable it. Alternatively the loops are so small and simple that the overhead of threading far exceeds the performance gain.