我和我的博士。学生在物理数据分析环境中遇到了一个问题,我可以对此进行一些见解。我们有代码可以分析来自大型强子对撞机实验之一的数据,该实验给出了不可重复的结果。特别是,从在同一台机器上运行的相同二进制文件获得的计算结果在连续执行之间可能会有所不同。我们知道不可再现性的许多不同来源,但排除了常见的嫌疑人。
我们将问题归结为在比较名义上具有相同值的两个数字时(双精度)浮点比较运算的不可再现性。由于分析中的先前步骤,这种情况有时可能会发生。我们刚刚找到一个测试数字是否小于 0.3 的示例(请注意,我们从不测试浮点值之间的相等性)。事实证明,由于探测器的几何形状,计算有时可能会产生恰好为 0.3(或其最接近的双精度表示形式)的结果。
我们非常清楚比较浮点数的陷阱,以及 FPU 中精度过高可能影响比较结果的可能性。我想回答的问题是“为什么结果不可重现?”是否因为 FPU 寄存器加载或其他 FPU 指令没有清除多余的位,因此先前计算的“剩余”位影响了结果? (这似乎不太可能)我在另一个论坛上看到一个建议,进程或线程之间的上下文切换也可能导致浮点比较结果发生变化,因为 FPU 的内容存储在堆栈上,因此被截断。对这些=或其他可能的解释的任何评论将不胜感激。
I and my Ph.D. student have encountered a problem in a physics data analysis context that I could use some insight on. We have code that analyzes data from one of the LHC experiments that gives irreproducible results. In particular, the results of calculations obtained from the same binary, run on the same machine can differ between successive executions. We are aware of the many different sources of irreproducibility, but have excluded the usual suspects.
We have tracked the problem down to irreproducibility of (double precision) floating point comparison operations when comparing two numbers that that nominally have the same value. This can happen occasionally as a result of prior steps in the analysis. An example we just found an example that tests whether a number is less than 0.3 (note that we NEVER test for equality between floating values). It turns out that due to the geometry of the detector, it was possible for the calculation to occasionally produce a result which would be exactly 0.3 (or its closest double precision representation).
We are well aware of the pitfalls in comparing floating point numbers and also with the potential for excess precision in the FPU to affect the results of the comparison. The question I would like to have answered is "why are the results irreproducible?" Is it because the FPU register load or other FPU instructions are not clearing the excess bits and thus "leftover" bits from previous calculations are affecting the results? (this seems unlikely) I saw a suggestion on another forum that context switches between processes or threads could also induce a change in floating point comparison results due to the contents of the FPU being stored on the stack, and thus, being truncated. Any comments on these =or other possible explanations would be appreciated.
发布评论
评论(7)
据猜测,发生的情况是,您的计算通常在 FPU 内部以一些额外的精度进行,并且仅在特定点进行舍入(例如,当您将结果分配给值时)。
然而,当发生上下文切换时,必须保存和恢复 FPU 的状态 - 并且至少很有可能在上下文切换中不保存和恢复这些额外的位。当这种情况发生时,这可能不会导致重大变化,但如果(例如)您稍后从每个金额中减去固定金额并乘以剩下的金额,则差异也会成倍增加。
需要明确的是:我怀疑“剩余”的部分会是罪魁祸首。相反,它会丢失额外的位,从而导致计算中略有不同的点进行舍入。
At a guess, what's happening is that your computations are normally being carried out to a few extra bits of precision inside the FPU, and only rounded at specific points (e.g., when you assign a result to a value).
When there's a context switch, however, the state of the FPU has to be saved and restored -- and there's at least a pretty fair chance that those extra bits are not being saved and restored in the context switch. When it happens, that probably wouldn't cause a major change, but if (for example) you later subtract off a fixed amount from each and multiply what's left, the difference would be multiplied as well.
To be clear: I doubt that "left over" bits would be the culprit. Rather, it would be loss of extra bits causing rounding at slightly different points in the computation.
什么平台?
大多数 FPU 可以在内部存储比 IEEE 双精度表示更高的精度 - 以避免中间结果中的舍入误差。通常编译器会切换到速度/准确性 - 请参阅 http ://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx
What platform?
Most FPUs can internally store more accuracy than the ieee double representation - to avoid rounding error in intermediate results. There is often a compiler switch to trade speed/accuracy - see http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx
程序是多线程的吗?
如果是,我会怀疑竞争条件。
如果不是,程序执行是确定性的。在给定相同输入的情况下获得不同结果的最可能原因是未定义的行为,即程序中的错误。读取未初始化的变量、过时的指针、覆盖堆栈上某些 FP 编号的最低位等。可能性是无限的。如果您在 Linux 上运行它,请尝试在 valgrind 下运行它,看看它是否发现了一些错误。
顺便说一句,你是如何将问题缩小到 FP 比较的?
(远景:硬件故障?例如,RAM 芯片故障可能会导致数据在不同情况下读取方式不同。不过,这可能会很快使操作系统崩溃。)
任何其他解释都是令人难以置信的——操作系统或硬件中的错误不会很长时间不被发现。
Is the program multi-threaded?
If yes, I would suspect a race condition.
If not, program execution is deterministic. The most probable reasong for getting different results given the same inputs is undefined behaviour, i.e., a bug in your program. Reading an uninitialized variable, stale pointer, overwriting lowest bits of some FP number on the stack, etc. The possibilities are endless. If you're running this on linux, try running it under valgrind and see if it uncovers some bugs.
BTW, how did you narrow down the problem to FP comparison?
(Long shot: failing hardware? E.g., failing RAM chip might cause data to be read differently on different occasions. Though, that'd probably crash the OS rather quickly.)
Any other explanation is implausible -- bugs in the OS or the HW would have not gone undiscovered for long.
我做了这个:
使用 /Qlong_double 使用 IntelC 进行编译,因此它产生了这个:
并启动了 10 个具有不同“种子”的实例。正如你所看到的,它比较了
10 字节长从内存中加倍,其中一个位于 FPU 堆栈上,因此在这种情况下
操作系统不能保持完整的精度,我们肯定会看到错误。
好吧,他们仍然在运行而没有检测到任何东西......这并不是真的
令人惊讶的是,因为 x86 有立即保存/恢复整个 FPU 状态的命令,
无论如何,不能保持完全精度的操作系统将被完全破坏。
所以要么是一些独特的操作系统/CPU/编译器,要么是不同的比较结果
在更改程序中的某些内容并重新编译它或其或其
程序中的错误,例如。缓冲区溢出。
I made this:
compiled with IntelC using /Qlong_double, so that it produced this:
and started 10 instances with different "seeds". As you can see, it compares the
10-byte long doubles from memory with one on the FPU stack, so in the case when
OS doesn't preserve full precision, we'd surely see an error.
And well, they're still running without detecting anything... which is not really
surprising, because x86 has commands to save/restore the whole FPU state at once,
and anyway an OS which won't preserve full precision would be completely broken.
So either its some unique OS/cpu/compiler, or differing comparison results
are produced after changing something in the program and recompiling it, or its
a bug in the program, eg. a buffer overrun.
CPU 的内部 FPU 可以以比 double 或 float 更高的精度存储浮点。每当寄存器中的值必须存储在其他地方时,包括当内存被换出到缓存中时(我知道这一事实)以及该核心上的上下文切换或操作系统中断听起来像是另一个简单的源,这些值都必须进行转换。当然,操作系统中断或上下文切换或非热内存交换的时间是应用程序完全无法预测和控制的。
当然,这取决于平台,但您的描述听起来像是在现代桌面或服务器(因此 x86)上运行。
The CPU's internal FPU can store floating points at higher accuracy than double or float. These values have to be converted whenever the values in the register have to be stored somewhere else, including when memory is swapped out into cache (this I know for a fact) and a context switch or OS interrupt on that core sounds like another easy source. Of course, the timing of OS interrupts or context switches or the swapping of non-hot memory is completely unpredictable and uncontrollable by the application.
Of course, this depends on platform, but your description sounds like you run on a modern desktop or server (so x86).
我将合并 David Rodriguez 和 Bo Persson 的一些评论并进行大胆的猜测。
使用SSE3指令时会不会发生任务切换?基于此关于使用 SSE3 指令的英特尔文章 保存寄存器状态的命令 FSAVE 和 FRESTOR 已被 FXSAVE 和 FXRESTOR 取代,它们应该处理
累加器的全长。
在 x64 机器上,我认为“不正确”的指令可能包含在某些外部编译库中。
I'll just merge some of the comments from David Rodriguez and Bo Persson and make a wild guess.
Could it be task switching while using SSE3 instructions? Based on this Intel article on using SSE3 instructions the commands to preserve register status FSAVE and FRESTOR have been replaced by FXSAVE and FXRESTOR, which should handle
the full length of the accumulator.
On an x64 machine, I suppose that the "incorrect" instruction could be contained in some external compiled library.
您肯定遇到了 GCC Bug n°323,与其他点一样out 是由于 FPU 精度过高造成的。
解决方案:
-ffloat-store
编译开关 。来自海湾合作委员会文档。You are certainly hitting GCC Bug n°323, which, as other points out is due to the excess precision of the FPU.
Solutions :
-ffloat-store
compile switch. From the GCC docs.