为什么我的代码在编译用于分析 (-pg) 时在多线程下运行比在单线程下运行慢?
我正在写一个光线追踪器。
最近,我向程序添加了线程,以利用 i5 四核上的附加内核。
奇怪的是,应用程序的调试版本现在运行速度变慢,但优化后的构建运行速度比添加线程之前更快。
我将“-g -pg”标志传递给 gcc 以进行调试构建,并将“-O3”标志传递给优化构建。
主机系统:Ubuntu Linux 10.4 AMD64。
我知道调试符号会给程序增加很大的开销,但相对性能始终保持不变。即,更快的算法在调试和优化构建中总是运行得更快。
知道为什么我会看到这种行为吗?
调试版本是用“-g3 -pg”编译的。带“-O3”的优化版本。
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
这让我相信“-g3”并不是奇怪的性能增量的罪魁祸首,而是“-pg”开关。 “-pg”选项很可能添加某种锁定机制来测量线程性能。
由于“-pg”在线程应用程序上无论如何都会被破坏,所以我将删除它。
I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果没有
-pg
标志,你会得到什么?这不是调试符号(不会影响代码生成),而是用于分析(会影响代码生成)。多线程进程中的分析很可能需要额外的锁定,这会减慢多线程版本的速度,甚至比非多线程版本慢。
What do you get without the
-pg
flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
你在这里谈论两件不同的事情。调试符号和编译器优化。如果您使用编译器必须提供的最强优化设置,则会导致丢失在调试中有用的符号。
您的应用程序运行速度变慢并不是因为调试符号,而是因为编译器进行的优化较少。
除了占用更多磁盘空间之外,调试符号并不是“开销”。以最大优化 (-O3) 编译的代码不应添加调试符号。这是当您不需要所述符号时可以设置的标志。
如果您需要调试符号,则可以以失去编译器优化为代价来获得它们。然而,再次强调,这不是“开销”,而只是缺乏编译器优化。
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
配置文件代码是否在足够多的函数中插入检测调用来伤害您?
如果您在汇编语言级别单步执行,您会很快找到答案。
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
多线程代码执行时间并不总是按照 gprof 的预期来测量。
除了 gprof 之外,您还应该使用其他计时器对代码进行计时,以查看差异。
我的示例:在 2NUMA 节点 INTEL sandy 桥(8 核 + 8 核)上运行 LULESH CORAL 基准测试,大小为 -s 50 和 20 次迭代 -i,使用 gcc 6.3.0、-O3 进行编译,我有:
运行 1 个线程:不带 -pg 的 ~3,7 和带 -pg 的 ~3,8,但根据 gprof 分析,代码仅运行了 3,5。
运行 16 个线程:~0,6(不带 -pg)和 ~0,8(带 -pg),但根据 gprof 分析,代码运行了 ~4,5 .. 。
粗体 中的时间是在并行区域之外(主函数的开始和结束)测量的 gettimeofday
因此,也许如果您以相同的方式测量应用程序时间,您会看到使用和不使用 -pg 时相同的速度。这只是 gprof 测量的并行错误。在 LULESH openmp 版本中,无论哪种方式。
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.