我尝试过:valgrind、_GLIBCXX_DEBUG、-fno-strict-aliasing;我该如何调试这个错误?

发布于 2024-11-25 18:26:24 字数 2563 浏览 4 评论 0原文

我有一个非常奇怪的错误,我花了几天时间试图弄清楚,所以现在我想看看是否有人有任何评论来帮助我理解发生了什么。

一些背景。我正在开发一个软件项目,其中涉及使用 Boost 1.45 将 C++ 扩展添加到 Python 2.7.1,因此我的所有代码都通过 Python 解释器运行。最近,我对代码进行了更改,这破坏了我们的一项回归测试。这个回归测试可能对数值波动(例如不同的机器)太敏感,所以我应该解决这个问题。然而,由于此回归在产生原始回归结果的同一台机器/编译器上中断,因此我将结果的差异追溯到这段数字代码(可验证与我更改的代码无关):

c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
               - 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
       c[3],
       df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);

它构造了一些数字表。请注意:

  • %a 打印提供精确的 ascii 表示
  • 左侧 (lhs) 是 c[3],右侧是其他 8 个值。
  • 下面的输出是远离 f, df 边界的 i 值,
  • 此代码存在于 i 上的循环内,该循环本身嵌套了多个层(因此我无法提供一个孤立的情况来重现此情况)。

因此,我克隆了我的源代码树,我编译的两个可执行文件之间的唯一区别是克隆包含一些额外的代码,这些代码甚至在此测试中都没有执行。这让我怀疑这一定是内存问题,因为唯一的区别应该是代码存在于内存中的位置...无论如何,当我运行这两个可执行文件时,这是它们产生的差异:

diff new.out old.out 
655,656c655,656
<  6  -0x1.7c2a5a75fc046p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
<  7   -0x1.a18f0b3a3eb8p-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
---
>  6  -0x1.7c2a5a75fc006p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
>  7  -0x1.a18f0b3a3ec5cp-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
<more output truncated>

您可以看到值c[3] 中的值略有不同,而 rhs 值都没有不同。因此,相同的输入会产生不同的输出。我尝试简化 rhs 表达式,但我所做的任何更改都会消除差异。如果我打印 &c[3],那么差异就会消失。如果我在可以访问的两台不同的机器(linux、osx)上运行,则没有区别。这是我已经尝试过的:

  • valgrind (报告了 python 中的许多问题,但我的代码中没有任何问题,也没有看起来很严重的问题)
  • -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (但没有断言)
  • -fno-strict-aliasing (但我进行别名编译升压代码中的警告)

我尝试在有问题的机器上从 gcc 4.1.2 切换到 gcc 4.5.2,并且这种特定的、孤立的差异消失了(但回归仍然失败,所以让我们假设这是一个不同的问题)。

我可以做些什么来进一步隔离问题吗?供以后参考,有什么方法可以更快地分析或理解此类问题吗?例如,鉴于我对 lhs 变化的描述,即使 rhs 没有变化,您会得出什么结论?

编辑: 该问题完全是由于 -ffast-math 造成的。

I have a really strange error that I've spend several days trying to figure out, and so now I want to see if anybody has any comments to help me understand what's happening.

Some background. I'm working on a software project which involves adding C++ extensions to Python 2.7.1 using Boost 1.45, so all my code is being run through the Python interpreter. Recently, I made a change to the code which broke one of our regression tests. This regression test is probably too sensitive to numerical fluctuations (e.g. different machines), so I should fix that. However, since this regression is breaking on the same machine/compiler that produced the original regression results, I traced the difference in results to this snippet of numerical code (which is verifiably unrelated to the code I changed):

c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
               - 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
       c[3],
       df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);

which constructs some numerical tables. Note that:

  • %a prints provides an exact ascii representation
  • The left hand side (lhs) is c[3], and the rhs is the other 8 values.
  • The output below was for values of i that were far from the boundaries of f, df
  • this code exists within a loop over i, which itself nested several layers (so I'm unable to provide an isolated case to reproduce this).

So I cloned my source tree, and the only difference between the two executables I compile is that the clone includes some extra code which isn't even executed in this test. This makes me suspect that it must be a memory problem, since the only difference should be where the code exists in memory... Anyway, when I run the two executables, here's the difference in what they produce:

diff new.out old.out 
655,656c655,656
<  6  -0x1.7c2a5a75fc046p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
<  7   -0x1.a18f0b3a3eb8p-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
---
>  6  -0x1.7c2a5a75fc006p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
>  7  -0x1.a18f0b3a3ec5cp-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
<more output truncated>

You can see that the value in c[3] is subtly different, while none of the rhs values are different. So some how identical input is giving rise to different output. I tried simplifying the rhs expression, but any change I make eliminates the difference. If I print &c[3], then the difference goes away. If I run on two different machines (linux, osx) I have access to, there's no difference. Here's what I've already tried:

  • valgrind (reported numerous problems in python, but nothing in my code, and nothing that looked serious)
  • -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (but nothing asserts)
  • -fno-strict-aliasing (but I do get aliasing compile warnings out of the boost code)

I tried switching from gcc 4.1.2 to gcc 4.5.2 on the machine that has the problem, and this specific, isolated difference goes away (but the regression still fails, so let's assume that's a different problem).

Is there anything I can do to isolate the problem further? For future reference, is there any way to analyze or understand this kind of problem quicker? For example, given my description of lhs changing even though rhs is not, what would you conclude?

EDIT:
The problem was entirely due to -ffast-math.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小嗲 2024-12-02 18:26:24

您可以更改程序的浮点数据类型。如果使用float,可以改成double;如果 c,f,df 是 double,则可以切换到 long double(intel 上为 80 位;sparc 上为 128)。对于 4.5.2,您甚至可以尝试使用 _float128(128 位)软件模拟类型。

浮点类型越长,舍入误差越小。

为什么添加一些代码(甚至未执行)会改变结果?如果代码大小发生变化,gcc 可能会以不同的方式编译程序。 GCC 内部有很多启发式方法,有些启发式方法是基于函数大小的。所以 gcc 可能会以不同的方式编译你的函数。

另外,尝试使用标志 -mfpmath=sse -msse2 编译您的项目,因为使用 x87(旧版 gcc 的默认 fpmath)是 http://gcc.gnu.org/wiki/x87note

默认情况下 x87 算术不是真正的 64/32 位 IEEE

PS:当您对稳定的数字结果感兴趣时,不应使用 -ffast-math 之类的选项:http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html

-ffast-math
设置 -fno-math-errno、-funsafe-math-optimizations、
-fno-trapping-math、-ffinite-math-only、-fno-rounding-math、-fno-signaling-nans 和 fcx-limited-range。

此选项会导致定义预处理器宏FAST_MATH

此选项永远不应由任何 -O 选项打开,因为它可能导致程序输出不正确,这取决于 IEEE 或 ISO 规则的精确实现/数学函数规范。


快速数学的这一部分可能会改变结果

-funsafe-math-optimizations
允许对浮点运算进行优化,(a) 假设参数和结果有效,并且 (b) 可能违反 IEEE 或 ANSI 标准。在链接时使用时,它可能包含更改默认 FPU 控制字或其他类似优化的库或启动文件。

这部分将对用户隐藏陷阱和类似 NaN 的错误(有时用户希望准确获取所有陷阱来调试他的代码)

-fno-trapping-math
编译代码假设浮点运算无法生成用户可见的陷阱。这些陷阱包括被零除、溢出、下溢、不精确结果和无效操作。此选项意味着 -fno-signaling-nans。例如,如果依赖于“不间断”IEEE 算法,设置此选项可能会允许更快的代码。

快速数学的这一部分表示,编译器可以在任何地方采用默认舍入模式(对于某些程序来说可能是错误的):

-fno-rounding-math
启用假设默认浮点舍入行为的转换和优化。对于所有浮点到整数的转换,这是舍入到零,对于所有其他算术截断,这是舍入到最接近的值。 ...此选项允许在编译时对浮点表达式进行常量折叠(这可能会受到舍入模式的影响)以及在存在符号相关舍入模式时不安全的算术转换。

You can change the type of floating-point data of your program. If you use float, you can switch to double; if c,f,df is double, you can switch to long double (80bit on intel; 128 on sparc). For 4.5.2 you can even try to use a _float128 (128bit) software-simulated type.

The rounding error will be less with longer floating-point type.

Why adding some code (even unexecuted) changes the result? The gcc may compile programm differently if the code size changes. There are a lot of heuristics inside the GCC and some heuristics are based on function sizes. So gcc may compile you function in different way.

Also, try to compile your project with flag -mfpmath=sse -msse2 because using x87 (default fpmath for older gcc) is http://gcc.gnu.org/wiki/x87note

by default x87 arithmetic is not true 64/32 bit IEEE

PS: you should not use -ffast-math-like options when you are interested in stable numberic results: http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html

-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations,
-fno-trapping-math, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and fcx-limited-range.

This option causes the preprocessor macro FAST_MATH to be defined.

This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

This part of fast-math may change results

-funsafe-math-optimizations
Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

This part will hide the traps and NaN-like errors from user (sometime user want to get all traps exactly to debug his code)

-fno-trapping-math
Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option implies -fno-signaling-nans. Setting this option may allow faster code if one relies on “non-stop” IEEE arithmetic, for example.

This part of fast math says, that compiler can assume a default rounding mode anywhere (which can be false for some programms):

-fno-rounding-math
Enable transformations and optimizations that assume default floating point rounding behavior. This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. ... This option enables constant folding of floating point expressions at compile-time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文