我尝试过:valgrind、_GLIBCXX_DEBUG、-fno-strict-aliasing;我该如何调试这个错误?
我有一个非常奇怪的错误,我花了几天时间试图弄清楚,所以现在我想看看是否有人有任何评论来帮助我理解发生了什么。
一些背景。我正在开发一个软件项目,其中涉及使用 Boost 1.45 将 C++ 扩展添加到 Python 2.7.1,因此我的所有代码都通过 Python 解释器运行。最近,我对代码进行了更改,这破坏了我们的一项回归测试。这个回归测试可能对数值波动(例如不同的机器)太敏感,所以我应该解决这个问题。然而,由于此回归在产生原始回归结果的同一台机器/编译器上中断,因此我将结果的差异追溯到这段数字代码(可验证与我更改的代码无关):
c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
- 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
c[3],
df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);
它构造了一些数字表。请注意:
- %a 打印提供精确的 ascii 表示
- 左侧 (lhs) 是 c[3],右侧是其他 8 个值。
- 下面的输出是远离 f, df 边界的 i 值,
- 此代码存在于 i 上的循环内,该循环本身嵌套了多个层(因此我无法提供一个孤立的情况来重现此情况)。
因此,我克隆了我的源代码树,我编译的两个可执行文件之间的唯一区别是克隆包含一些额外的代码,这些代码甚至在此测试中都没有执行。这让我怀疑这一定是内存问题,因为唯一的区别应该是代码存在于内存中的位置...无论如何,当我运行这两个可执行文件时,这是它们产生的差异:
diff new.out old.out
655,656c655,656
< 6 -0x1.7c2a5a75fc046p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
< 7 -0x1.a18f0b3a3eb8p-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
---
> 6 -0x1.7c2a5a75fc006p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
> 7 -0x1.a18f0b3a3ec5cp-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
<more output truncated>
您可以看到值c[3] 中的值略有不同,而 rhs 值都没有不同。因此,相同的输入会产生不同的输出。我尝试简化 rhs 表达式,但我所做的任何更改都会消除差异。如果我打印 &c[3],那么差异就会消失。如果我在可以访问的两台不同的机器(linux、osx)上运行,则没有区别。这是我已经尝试过的:
- valgrind (报告了 python 中的许多问题,但我的代码中没有任何问题,也没有看起来很严重的问题)
- -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (但没有断言)
- -fno-strict-aliasing (但我进行别名编译升压代码中的警告)
我尝试在有问题的机器上从 gcc 4.1.2 切换到 gcc 4.5.2,并且这种特定的、孤立的差异消失了(但回归仍然失败,所以让我们假设这是一个不同的问题)。
我可以做些什么来进一步隔离问题吗?供以后参考,有什么方法可以更快地分析或理解此类问题吗?例如,鉴于我对 lhs 变化的描述,即使 rhs 没有变化,您会得出什么结论?
编辑: 该问题完全是由于 -ffast-math
造成的。
I have a really strange error that I've spend several days trying to figure out, and so now I want to see if anybody has any comments to help me understand what's happening.
Some background. I'm working on a software project which involves adding C++ extensions to Python 2.7.1 using Boost 1.45, so all my code is being run through the Python interpreter. Recently, I made a change to the code which broke one of our regression tests. This regression test is probably too sensitive to numerical fluctuations (e.g. different machines), so I should fix that. However, since this regression is breaking on the same machine/compiler that produced the original regression results, I traced the difference in results to this snippet of numerical code (which is verifiably unrelated to the code I changed):
c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
- 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
c[3],
df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);
which constructs some numerical tables. Note that:
- %a prints provides an exact ascii representation
- The left hand side (lhs) is c[3], and the rhs is the other 8 values.
- The output below was for values of i that were far from the boundaries of f, df
- this code exists within a loop over i, which itself nested several layers (so I'm unable to provide an isolated case to reproduce this).
So I cloned my source tree, and the only difference between the two executables I compile is that the clone includes some extra code which isn't even executed in this test. This makes me suspect that it must be a memory problem, since the only difference should be where the code exists in memory... Anyway, when I run the two executables, here's the difference in what they produce:
diff new.out old.out
655,656c655,656
< 6 -0x1.7c2a5a75fc046p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
< 7 -0x1.a18f0b3a3eb8p-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
---
> 6 -0x1.7c2a5a75fc006p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
> 7 -0x1.a18f0b3a3ec5cp-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
<more output truncated>
You can see that the value in c[3] is subtly different, while none of the rhs values are different. So some how identical input is giving rise to different output. I tried simplifying the rhs expression, but any change I make eliminates the difference. If I print &c[3], then the difference goes away. If I run on two different machines (linux, osx) I have access to, there's no difference. Here's what I've already tried:
- valgrind (reported numerous problems in python, but nothing in my code, and nothing that looked serious)
- -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (but nothing asserts)
- -fno-strict-aliasing (but I do get aliasing compile warnings out of the boost code)
I tried switching from gcc 4.1.2 to gcc 4.5.2 on the machine that has the problem, and this specific, isolated difference goes away (but the regression still fails, so let's assume that's a different problem).
Is there anything I can do to isolate the problem further? For future reference, is there any way to analyze or understand this kind of problem quicker? For example, given my description of lhs changing even though rhs is not, what would you conclude?
EDIT:
The problem was entirely due to -ffast-math
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以更改程序的浮点数据类型。如果使用float,可以改成double;如果
c
,f
,df
是 double,则可以切换到 long double(intel 上为 80 位;sparc 上为 128)。对于 4.5.2,您甚至可以尝试使用_float128
(128 位)软件模拟类型。浮点类型越长,舍入误差越小。
为什么添加一些代码(甚至未执行)会改变结果?如果代码大小发生变化,gcc 可能会以不同的方式编译程序。 GCC 内部有很多启发式方法,有些启发式方法是基于函数大小的。所以 gcc 可能会以不同的方式编译你的函数。
另外,尝试使用标志
-mfpmath=sse -msse2
编译您的项目,因为使用 x87(旧版 gcc 的默认 fpmath)是 http://gcc.gnu.org/wiki/x87notePS:当您对稳定的数字结果感兴趣时,不应使用
-ffast-math
之类的选项:http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html快速数学的这一部分可能会改变结果
这部分将对用户隐藏陷阱和类似 NaN 的错误(有时用户希望准确获取所有陷阱来调试他的代码)
快速数学的这一部分表示,编译器可以在任何地方采用默认舍入模式(对于某些程序来说可能是错误的):
You can change the type of floating-point data of your program. If you use float, you can switch to double; if
c
,f
,df
is double, you can switch to long double (80bit on intel; 128 on sparc). For 4.5.2 you can even try to use a_float128
(128bit) software-simulated type.The rounding error will be less with longer floating-point type.
Why adding some code (even unexecuted) changes the result? The gcc may compile programm differently if the code size changes. There are a lot of heuristics inside the GCC and some heuristics are based on function sizes. So gcc may compile you function in different way.
Also, try to compile your project with flag
-mfpmath=sse -msse2
because using x87 (default fpmath for older gcc) is http://gcc.gnu.org/wiki/x87notePS: you should not use
-ffast-math
-like options when you are interested in stable numberic results: http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.htmlThis part of fast-math may change results
This part will hide the traps and NaN-like errors from user (sometime user want to get all traps exactly to debug his code)
This part of fast math says, that compiler can assume a default rounding mode anywhere (which can be false for some programms):