OpenMP + SSE 没有提供加速
我的教授发现了这个使用 SSE 和 OpenMP 的 3D 线性可分离核卷积的有趣实验,并给了我在我们的系统上对统计数据进行基准测试的任务。作者声称串行方法可实现 18 倍的疯狂加速!可能并不总是如此,但我们预计在双核 Intel 上运行它至少会加速 2-4 倍。
唉,我们发现根本没有加速。无论有或没有 OpenMP,串行代码的性能总是更好。
我正在使用 Linux,并观察到某种趋势......当系统上没有其他进程运行时,一段时间后,loadavg 开始增加,并且 %CPU 利用率下降。
我意外遇到的另一个可能的误报...我启动了该程序,然后立即暂停了它。然后我用 bg 在后台运行它,发现加速超过 2。这种情况经常发生!
任何建议都会很棒。
谢谢, 萨彦
My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel.
Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP.
I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down.
Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time!
Any advice would be great.
Thanks,
Sayan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您确实需要分析您的程序以识别瓶颈。您还需要以更“整体”的方式看待优化。您的性能问题可能与糟糕的设计、糟糕的编码、内存带宽限制以及许多其他问题有关,这些问题都无法通过微优化(例如使用 SIMD 而不是标量代码)来解决。
从配置文件开始(为此使用 Zoom 等工具)并从那里开始工作。
You really need to profile your program to identify the bottlenecks. You also need to look at optimisation in a more "holistic" way. Your performance issues may be related to poor design, poor coding, memory bandwidth limitations, and a host of other problems, none of which will be addressed by micro-optimisations such as using SIMD instead of scalar code.
Start with a profile (use a tool like Zoom for this) and work from there.
好吧,我摸索了一下,然后尝试了以下操作:我使用 -O0 选项(无优化)编译了程序,几乎所有 XYZ 值的加速比都达到了 2。我还可以看到我的双核上使用了 2 个线程(以前只使用一个线程)。
但现在,当我删除 OpenMP 编译指示时,我看不到任何加速,这让我很烦恼,因为 SSE 应该能够大大加快速度。所以这种加速完全可以归功于 OpenMP,必须找出 SSE 失败的原因。有人告诉我,如果操作很琐碎(也许这个词的重要性是有争议的,因为它因人而异),使用 SSE 不会获得任何加速。但我写了一个小程序,计算 i_max_size = 64000 的 sqrt(i)/i .....并且 SSE 版本的加速比为 3.5 ~ 4.0。
一旦找到根本原因,我会发布更多内容。
Well I groped around a bit, and then tried the following: I compiled the program using the -O0 option (no optimization) and got a speedup of 2 almost for almost all the XYZ Values. I could also see that 2 threads are utilized on my dual core (previously, it was using only one).
But now, when I remove the OpenMP pragmas, I could see no speedup, this bothers me, because SSE should be able to speed things up considerably. So this speedup could be entirely be attributed to OpenMP, have to find out why SSE is failing. Somebody had told me that if operations are trivial (perhaps the weight that this word puts forth is debatable since it differs from person to person), using SSE garners no speedup. But I wrote a small program, that calculates sqrt(i)/i for i_max_size = 64000.....and the SSE version gave a speedup of 3.5 ~ 4.0.
I would post more once I find the root cause.