gcc 的 ffast-math 实际上是做什么的?
我了解 gcc 的 --ffast-math
标志可以极大地提高浮点运算的速度,并且超出了 IEEE 标准,但我似乎找不到有关它打开时真正发生的情况的信息。任何人都可以解释一些细节,并给出一个清晰的示例,说明如果标志打开或关闭,情况会如何变化?
我确实尝试过挖掘类似的问题,但找不到任何解释 ffast-math 工作原理的内容。
I understand gcc's --ffast-math
flag can greatly increase speed for float ops, and goes outside of IEEE standards, but I can't seem to find information on what is really happening when it's on. Can anyone please explain some of the details and maybe give a clear example of how something would change if the flag was on or off?
I did try digging through S.O. for similar questions but couldn't find anything explaining the workings of ffast-math.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
-ffast-math
所做的不仅仅是违反严格的 IEEE 合规性。有关各种详细信息,请参阅 https://gcc.gnu.org/wiki/FloatingPointMath具体选项以及 GCC FP 的一般行为。请注意,-fno-rounding-math 是默认值,因此 GCC 假定舍入模式是 IEEE 默认的最接近值,甚至作为平局,允许编译时常量折叠。
首先,当然,它确实违反了严格的 IEEE 合规性,例如允许将指令重新排序为数学上相同(理想情况下)但浮点形式不完全相同的指令。
其次,它禁用在单指令数学函数之后设置
errno
,这意味着避免写入线程局部变量(这对于那些函数来说可能会产生100%的差异)一些架构)。-fno-math-errno
在数学调用后不读取errno
的程序中是完全安全的,并且还允许更好地内联诸如lrint
之类的函数>。 (例如,在具有 SSE4 的 x86 上:Godbolt。)在 ISO C 中,从 math.h 函数设置errno
是可选的,因此快速数学的这一部分仍然符合标准。第三,它假设所有数学都是有限的,这意味着不会对 NaN(或零)进行检查,因为它们会产生有害影响。人们只是假设这不会发生。 (
-ffinite-math-only
)第四,它支持除法和倒数平方根的倒数近似。 (
-funsafe-math-optimizations
可以实现这一点和其他功能)此外,它还禁用有符号零(代码假设有符号零不存在,即使目标支持它)和舍入数学,这可以实现其他功能事物在编译时不断折叠。 (
-fno-signed-zeros
)。例如,这允许将x + 0.0
优化为x
。如果没有该选项,则只能将x - 0.0
和x * 1.0
优化为x
。最后,它生成的代码假设不会因信令/陷阱数学而发生硬件中断(也就是说,如果无法在目标架构上禁用这些中断并因此确实发生,则不会处理它们) 。
-fno-trapping-math -fno-signaling-nans
。-fno-trapping-math
的另一个效果是设置与否(当屏蔽异常时)fenv
标志不被视为可观察到的副作用。 (默认情况下,无论是否进行快速数学计算,所有 FP 异常都会被屏蔽,因此例如sqrt(-1)
给出 NaN 而不是提高 SIGFPE。)GCC 的默认值是-ftrapping -math
,但它不能完美地工作,有时允许优化将可能的 FP 异常数量从 0 更改为非零,反之亦然(如果这是它试图在第一名?)。更糟糕的是,有时会阻止安全优化。对于不使用fenv
内容 的代码,例如feclearexcept()
和fetestexcept()
、-fno-trapping-math
是安全的(至少在普通 ISA 上)并且可以实现显着的优化。请参阅为什么选择 gcc std::vector当链接时使用
-ffast-math
时,GCC 将链接到以不同方式设置 FPU 标志的 CRT 启动代码。例如,在 x86 上,它设置 SSEmxcsr
FTZ 和 DAZ 控制位,将次法线刷新为 0,而不是逐渐下溢(这在许多 CPU 上需要微码辅助。)(FTZ = 刷新到零)对于次正规结果,DAZ = 非正规值为零,用于指令的次正规输入(包括比较)。)大多数代码可以使用
-O3 -fno-math-errno -fno-trapping-math
。与 -ffast-math 的其他部分不同,它们从不影响数值结果,只会影响优化器尝试保留的其他副作用是否重要。 (-fno-signaling-nans
已经是默认值,不需要指定。)-ffast-math
does a lot more than just break strict IEEE compliance.See https://gcc.gnu.org/wiki/FloatingPointMath for details on the various more-specific options and on GCC FP behaviour in general. Note that
-fno-rounding-math
is the default, so GCC assumes the rounding mode is IEEE default of nearest with even as a tie-break, allowing compile-time constant folding.First of all, of course, it does break strict IEEE compliance, allowing e.g. the reordering of instructions to something which is mathematically the same (ideally) but not exactly the same in floating point.
Second, it disables setting
errno
after single-instruction math functions, which means avoiding a write to a thread-local variable (this can make a 100% difference for those functions on some architectures).-fno-math-errno
is fully safe in programs that don't readerrno
after math calls, and also allows better inlining of functions likelrint
. (For example on x86 with SSE4: Godbolt.) Settingerrno
from math.h functions is optional in ISO C, so this part of fast-math is still standards-compliant.Third, it makes the assumption that all math is finite, which means that no checks for NaN (or zero) are made in place where they would have detrimental effects. It is simply assumed that this isn't going to happen. (
-ffinite-math-only
)Fourth, it enables reciprocal approximations for division and reciprocal square root. (
-funsafe-math-optimizations
enables that and other things)Further, it disables signed zero (code assumes signed zero does not exist, even if the target supports it) and rounding math, which enables among other things constant folding at compile-time. (
-fno-signed-zeros
). For example, this allows optimizingx + 0.0
tox
. Without that option, onlyx - 0.0
andx * 1.0
can be optimized tox
.Last, it generates code that assumes that no hardware interrupts can happen due to signalling/trapping math (that is, if these cannot be disabled on the target architecture and consequently do happen, they will not be handled).
-fno-trapping-math -fno-signaling-nans
.The other effect of
-fno-trapping-math
is that settingfenv
flags or not (when exceptions are masked) isn't considered an observable side-effect. (By default, all FP exceptions are masked, regardless of fast-math or not, so for examplesqrt(-1)
gives a NaN instead of raising SIGFPE.) GCC's default is-ftrapping-math
, but it doesn't work perfectly, sometimes allowing optimizations that change the number of possible FP exceptions from 0 to non-zero or vice-versa (if that's something it was trying to preserve in the first place?). And worse, sometimes blocking safe optimizations. For code that doesn't usefenv
stuff likefeclearexcept()
andfetestexcept()
,-fno-trapping-math
is safe (on normal ISAs at least) and can enable significant optimizations. See Why gcc is so much worse at std::vector<float> vectorization of a conditional multiply than clang? for example.When
-ffast-math
is used while linking, GCC will link with CRT startup code that sets FPU flags differently. For example on x86, it sets the SSEmxcsr
FTZ and DAZ control bits, to flush subnormals to 0 instead of doing gradual underflow (which takes a microcode assist on many CPUs.) (FTZ = Flush To Zero for subnormal results, DAZ = Denormals Are Zero for subnormal inputs to instructions including compares.)Most code can use
-O3 -fno-math-errno -fno-trapping-math
. Unlike other parts of-ffast-math
, they never affect numerical results, only whether other side-effects are considered significant for the optimizer to try to preserve. (-fno-signaling-nans
is already the default and doesn't need to be specified.)正如您所提到的,它允许进行不严格遵守 IEEE 合规性的优化。
一个例子是这样的:
to
由于浮点运算不具有结合律,因此运算的排序和因式分解会因舍入而影响结果。因此,这种优化并不是在严格的 FP 行为下进行的。
我实际上并没有检查 GCC 是否真的做了这个特定的优化。但想法是一样的。
As you mentioned, it allows optimizations that do not preserve strict IEEE compliance.
An example is this:
to
Because floating-point arithmetic is not associative, the ordering and factoring of the operations will affect results due to round-off. Therefore, this optimization is not done under strict FP behavior.
I haven't actually checked to see if GCC actually does this particular optimization. But the idea is the same.
-fast_math 的主要问题是可重复性,因为许多问题不需要完全精度。 使用浮点算术进行计算
没有“正确”的求值顺序,例如,在线性链中与在二叉树中。后者确实碰巧具有较低的预期误差,但它们都具有相同的最坏情况误差,尽管两者都不是明确的“正确” 。然而,在 C++ 或 C 中将其表达为:
具有规定的求值顺序。在运行跨平台验证测试时,可重复性至关重要。使用 -fast_math,可以更改计算顺序,这不会违反 IEEE 标准,但会违反 C++ 规则,并对再现性造成严重破坏。
但我想知道其中一些 -fast_math 黑客是否真的能在典型部署中产生更快的计算速度。例如,假设我们有有界的数字,例如 [-1,+1],
-ffinite-math-only
和-fno-denorms
实际上会产生任何区别在性能方面?据我了解,这些只会在实际遇到并导致陷阱时减慢速度。The primary issue with -fast_math is that of reproducability, because many problems do not require full precision. Computing
using floating-point arithmetic has no "right" order of evaluation, for example, in a linear chain versus in a binary tree. The latter does happen to have a lower expected error, but they both have the same worst-case error, though neither is unequivocally "right". However, expressing this in C++ or C as:
has a prescribed order of evaluation. And when running cross-platform validation tests, reproducibility is paramount. With -fast_math, the order of evaluation can be changed, violating nothing in IEEE standards, but violating C++ rules, and wreaking havoc with reproducibility.
But I'm wondering whether some of these -fast_math hacks really do produce faster computations in typical deployments. For example, suppose we have well-bounded numbers, say [-1,+1], would
-ffinite-math-only
and-fno-denorms
actually make any difference in performance? It is my understanding that these only slow things down if they are actually encountered and cause a trap.