将 powerpc 移植到 intel 的数字代码使用 float 给出了不同的结果

发布于 2024-08-19 14:49:19 字数 1984 浏览 10 评论 0原文

我的基本问题是如何使 x86 上的浮点数算术表现得像 PowerPC,从经典 MacOS (CodeWarrior) 到 Windows (VS 2008)。

有问题的代码有很多,有一堆高度迭代且对数值非常敏感的算法。

典型的复杂行是:

Ims_sd = sqrt((4.0*Ams*sqr(nz)-8.0*(Ams+Dms)*nz+12.0*sqr(Ams)) /
         (4.0*sqr(Ams)*(sqr(nz)-1)) - 
         sqr(Ims_av))*sqrt(nz-1);

它是使用 typedef 的 float 作为基本类型编写的。

更改为 double 在两个平台上都会得到非常相似的结果,但不幸的是,这些数字是不可接受的,因此我们不能采取这种简单的方法。

Mac 代码是使用 CodeWarrior 编译的,只需关闭 FMADD 和 FMADD 的生成即可。 FMSUB 指令对创建的数字有巨大的影响。因此,我的出发点是搜索看起来最相似的 Visual Studio (2008) 选项 - 确保 fused add 是 被使用。我们怀疑关键在于编译器在计算中分配中间存储的行为。

目前,通过启用 SSE2 和 /fp:fast 的组合可以获得最佳结果。启用内部函数会导致值偏离 Mac 值更远。

/fp 开关文档表示只有 /fp: strict 关闭融合添加行为。

MSDN 谈论在 LIBC.LIB、LIBCMT 之前链接 FP10.OBJ .LIB 或 MSVCRT.LIB。”到 保证64位精度。我显然是通过在链接器输入字段上指定 FP10.OBJ 来实现这一点的(详细的链接器输出在 MSVCRTD.lib 之前显示它)。

设置了 64 位精度

_controlfp_s(&control_word, _PC_64, MCW_PC);

我还通过调用DllMain

。请注意,问题不是由于平台之间浮点异常处理的差异,也不是由于 PowerPC 允许除以零整数(仅返回零)的(令人愉快的)方式,因为这些区域已经被已审核并解决,非常感谢 PC-Lint。该程序运行并产生一些看似合理的输出,但还不够好。

更新:

一位朋友的有趣评论: 一种可能性是 PPC 有大量可以存储 64 位中间值的临时寄存器,而 x86 代码可能必须卸载并重新加载 FPU(截断为 4 字节并丢失精度)。

这可能就是 SSE2 工作得更好的原因,因为 (IIRC) 它有更多的寄存器和更大的保存中间值的范围。

一种可能性 - 您的代码可以编译为 64 位吗? x64 模式还具有更多的中间寄存器和更好的 FP 指令,因此它在设计和执行上可能更接近 PPC。

使用 64 位构建进行的初始测试实际上更加接近,正如他所建议的那样(我首先认为它超出了范围,但这是由于建模设置不正确)。

最终解决方案

我相信任何对此主题感兴趣的人都非常着迷,他们想知道这一切最终是如何解决的。该软件已完成并提供一致的数值结果。我们始终无法让所有算法都能向 Mac 提供相同的结果,但它们足够接近,在统计上可以接受。鉴于处理过程是由专家用户选择感兴趣的领域来指导的,并且用户输入对模型的进展有部分反应,首席科学家认为这是可以接受的(这不是一夜之间的决定!)。其余的数字差异完全在决定不同临床结果的范围内,因此测试中没有发现不同的诊断。

My essential problem is how to make arithmetic with floats on x86 behave like a PowerPC, going from Classic MacOS (CodeWarrior) to Windows (VS 2008).

The code in question, of which there is a lot, has a pile of algorithms which are highly iterative and numerically very sensitive.

A typical complex line is:

Ims_sd = sqrt((4.0*Ams*sqr(nz)-8.0*(Ams+Dms)*nz+12.0*sqr(Ams)) /
         (4.0*sqr(Ams)*(sqr(nz)-1)) - 
         sqr(Ims_av))*sqrt(nz-1);

It is written using a typedef'd float as the base type.

Changing to double gets very similar results on both platforms but unfortunately the numbers are not acceptable so we can't take that easy way out.

The Mac code is compiled using CodeWarrior and just turning off the generation of the FMADD & FMSUB instructions had a drastic effect on the numbers created. So, my starting point was to search for the Visual Studio (2008) options that seemed most similar - making sure fused add was
being used. We suspect that the key lies in the behaviour of the compiler in allocating intermediate storage in computations

Currently the best results are being obtained with a combination of enabling SSE2 and /fp:fast. Enabling intrinsic functions causes values to drift further from the Mac values.

The /fp switch documentation says that only /fp:strict turns off the fused add behaviour.

MSDN talks about linking FP10.OBJ "before LIBC.LIB, LIBCMT.LIB, or MSVCRT.LIB." to
guarantee 64 bit precision. I've apparently achieved this by specifying FP10.OBJ on the linker input field (verbose linker output shows it prior to MSVCRTD.lib).

I've also set 64 bit precision by invoking

_controlfp_s(&control_word, _PC_64, MCW_PC);

in DllMain.

Note that the problem is not due to differences in floating point exception handling between platforms nor is due to the (delightful) way that PowerPC allows division by zero integers (just returning zero) as these areas have already been audited and addressed, thanks hugely to PC-Lint. The program runs and produces somewhat plausible output, just not quite good enough.

UPDATE:

An interesting comment from a friend:
One possibility is that the PPC has a large number of temporary registers that can store 64 bit intermediate values whereas the x86 code may have to unload and reload the FPU (truncating to 4 bytes and losing precision).

This may be why SSE2 works better as (IIRC) it has more registers and more scope for preserving intermediate values.

One possibility - can your code be compiled as 64 bit? The x64 mode also has more registers for intermediates, and better FP instructions so it may be closer to the PPC in design and execution.

Initial testing with a 64-bit build actually got closer, as he suggested it might (I first thought it overshot but that was due to an incorrect modeling setting).

Final Resolution

I'm sure anyone interested in this topic is sufficiently obsessive they would like to know how it all worked out in the end. The software is finished and delivering consistent numeric results. We were never able to get all the algorithms to deliver identical results to the Mac but they were close enough to be statistically acceptable. Given that the processing is guided by an expert user selecting the areas of interest and that user input is partly reactive to how the model progresses, the chief scientist deemed it acceptable (this was not an overnight decision!). The remaining numeric differences are well within the bounds of what determines different clinical results so no different diagnoses have been seen with testing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

沦落红尘 2024-08-26 14:49:19

跨多个平台的浮点确定性的整个问题似乎是一个非常棘手的问题,你越深入地研究它,它似乎变得越糟糕。

我确实找到了 这个有趣的文章深入讨论了这个问题 - 它也许能够提出一些想法。

The whole question of floating point determinism across multiple platforms seems to be a very thorny issue and the more you dig into it, the worse it seems to get.

I did find this interesting article that discusses the problem in great depth - it might be able to throw up some ideas.

嘦怹 2024-08-26 14:49:19

我建议您参考GCC bug 323

我欢迎 bug 323 社区的最新成员,在这里,gcc 中的所有 x87 浮点错误都会消失!所有使用 x87 的浮点错误都是受欢迎的,尽管其中许多错误很容易修复,但也有许多错误不容易修复!我们都是幸福的一家人,却犯了一个严重的错误:希望市场上最准确的通用 FPU 具有准确性!

简而言之,在 x87 上获得“真正的”IEEE 浮点单打/双打而不显着影响性能是极其乏味的;即使您使用 fldcw ,由于指数范围减小(IIRC、IEEE FP 特别允许实现执行自己的 WRT denorms 操作),您也会遭受 denorms 的双舍入。想必您可以执行以下操作:

  1. 舍入到正无穷大,执行运算(获取 ldresult1),舍入到最接近的偶数,转换为浮点数(获取 fresult1)。
  2. RTNI,执行操作,RTNE,转换为浮点数。
  3. 如果它们相同,那就太好了:您获得了正确的 RTNE 浮动结果。如果不是,那么(我认为)fresult2 < fresult1,进而,fresult1=nextafterf(fresult2,+inf),有两种可能:
    • ldresult1 == ((long double)fresult1+fresult2)/2。 “正确”答案是 fresult2。
    • ldresult2 == ((long double)fresult1+fresult2)/2。 “正确”答案是 fresult1。

我可能在某些细节上是错误的,但这大概是你得到分母时必须经历的痛苦。

然后你遇到了另一个问题:我很确定 sqrt() 不能保证在不同的实现中返回相同的解析(对于三角函数也非常确定);我见过的唯一保证是结果“在 1 ulp 之内”(大概是正确舍入的结果)。它高度依赖于所使用的算法,并且现代 CPU 具有这些指令,因此如果您尝试在软件中实现它,您将遭受显着的性能损失。尽管如此,ISTR 是一个“便携式”浮点库,应该能够实现一致性,但我不记得名字 OTTOMH。

I refer you to GCC bug 323:

I'd like to welcome the newest members of the bug 323 community, where all x87 floating point errors in gcc come to die! All floating point errors that use the x87 are welcome, despite the fact that many of them are easily fixable, and many are not! We're all one happy family, making the egregious mistake of wanting accuracy out of the most accurate general purpose FPU on the market!

The short summary is that it's incredibly tedious to get "true" IEEE floating-point singles/doubles on an x87 without significant performance penalty; you suffer from double-rounding of denorms even if you use fldcw due to the reduced exponent range (IIRC, IEEE FP specifically allows implementations to do their own thing WRT denorms). Presumably you could do something like this:

  1. Round to positive infinity, perform the operation (getting ldresult1), round to nearest even, convert to float (getting fresult1).
  2. RTNI, perform the op, RTNE, convert to float.
  3. If they're the same, great: You have the correct RTNE float result. If not, then (I think) fresult2 < fresult1, and furthermore, fresult1=nextafterf(fresult2,+inf), and there are two possibilities:
    • ldresult1 == ((long double)fresult1+fresult2)/2. The "correct" answer is is fresult2.
    • ldresult2 == ((long double)fresult1+fresult2)/2. The "correct" answer is is fresult1.

I'm probably wrong in the details somewhere, but this is presumably the pain you have to go through when you get a denorm.

And then you hit the other issue: I'm pretty sure there's no guarantee about sqrt() returning the same resolt across different implementations (and very sure for trig functions); the only guarantee I've ever seen is that the result is "within 1 ulp" (presumably of the correctly rounded result). It's highly dependent on the algorithm used, and modern CPUs have instructions for these, so you suffer a significant performance penalty if you try to implement it in software. Nevertheless, ISTR a "portable" floating point library somewhere which was supposed to achieve consistency, but I don't remember the name OTTOMH.

慈悲佛祖 2024-08-26 14:49:19

本身不是一个答案,但更多的文本(和格式)超出了我在评论中所能容纳的范围。读到你的问题,我觉得你可能已经考虑了所有这些,但没有告诉我们,所以这可能都是无关紧要的闲聊。如果是的话,我深表歉意。

您能否(是吗?)在程序的原始版本或移植版本上强制遵守浮点运算的 IEEE754 规则?我的第一个猜测是,这两个平台(硬件、操作系统、库的组合)实现了不同的 fp 算术方法。

您对两个平台上某些基本类型(例如整数和浮点数)的默认大小做出了哪些假设(如果有)。 C 标准(我相信 C++ 标准)允许某些此类的平台依赖性(我一时想不起来,我真的是一个 Fortran 程序员)。

最后的猜测——我已经习惯(在我的 Fortranny 世界中)指定浮点常量,例如 4.0,并用足够的数字来指定首选表示形式中的所有(十进制)数字,即类似 4.000000000000000000000000 的东西。我知道,在 Fortran 中,4 字节浮点常量(例如 3.14159625)在自动转换为 8 字节时,不会用 pi 的十进制表达式中的更多数字填充额外的字节。这可能会影响你。

这些都不能真正帮助您确保代码的移植版本产生与原始版本相同的结果,仅识别差异来源。

最后,您是否要求新版本产生与旧版本相同的结果,或者您向客户保证新版本产生准确的答案?考虑到数值计算中的所有误差源,您的问题可能会导致旧版本的程序比新版本“错误”。

Not an answer as such, but more text (and formatting) than I could fit in a comment. Reading your question, it strikes me that you have probably considered all of this, but not told us, so this may all be irrelevant chatter. If it is, I apologise.

Can you (did you ?) enforce adherence to IEEE754 rules for floating-point arithmetic on either the original or ported versions of the program ? My first guess is that the two platforms (combination of hardware, o/s, libraries) implement different approaches to fp arithmetic.

What assumptions (if any) have you made about the default sizes, on the two platforms, of some of the fundamental types such as ints and floats. The C standard (and I believe the C++ standard) allow platform-dependency for some such (can't off the top of my head remember which, I'm really a Fortran programmer).

Final guess -- I've grown used (in my Fortranny world) to specifying float constants such as your 4.0 with sufficient digits to specify all the (decimal) digits in the preferred representation, ie something like 4.000000000000000000000000. I know that, in Fortran, a 4-byte float constant such as 3.14159625 will, when automatically cast to 8-bytes, not fill the extra bytes with the further digits in the decimal expression of pi. This may be affecting you.

None of this really helps you ensure that the ported version of your code produces the same, to the bit, results as the original version, only identify sources of difference.

Finally, is your requirement that the new version produce the same results as the old version, or that you provide assurance to your customers that the new version produces accurate answers ? Your question leaves open the possibility that the old version of the program was 'wronger' than the new, taking into account all the sources of error in numerical computations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文