为什么 GCC 不将 a*a*a*a*a*a 优化为 (a*a*a)*(a*a*a)?
我正在对科学应用程序进行一些数值优化。我注意到的一件事是,GCC 会通过将调用 pow(a,2)
编译为 a*a
来优化调用 pow(a,2)
,但是调用 pow(a,6 )
没有进行优化,实际上会调用库函数pow
,大大降低了性能。 (相比之下,英特尔 C++ 编译器,可执行文件icc
,将消除库调用对于 pow(a,6)
。)
我很好奇的是,当我用 a*a*a*a 替换
使用 GCC 4.5.1 和选项“pow(a,6)
时*a*a-O3 -lm -funroll-loops -msse4
”,它使用 5 个 mulsd
指令:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
而如果我写 (a*a* a)*(a*a*a)
,它将产生
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
将乘法指令的数量减少到 3 条。icc
具有类似的行为。
为什么编译器不能识别这个优化技巧?
I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2)
by compiling it into a*a
, but the call pow(a,6)
is not optimized and will actually call the library function pow
, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc
, will eliminate the library call for pow(a,6)
.)
What I am curious about is that when I replaced pow(a,6)
with a*a*a*a*a*a
using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4
", it uses 5 mulsd
instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write (a*a*a)*(a*a*a)
, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. icc
has similar behavior.
Why do compilers not recognize this optimization trick?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
因为浮点数学不结合。浮点乘法中操作数的分组方式会影响结果的数值准确性。
因此,大多数编译器对于重新排序浮点计算都非常保守,除非他们可以确定答案将保持不变,或者除非您告诉他们您不关心数值精度。例如:
-fassociative-math
选项 gcc 允许 gcc 重新关联浮点运算,甚至是-ffast-math
选项,它允许在精度与速度之间进行更积极的权衡。Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.
As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the
-fassociative-math
option of gcc which allows gcc to reassociate floating point operations, or even the-ffast-math
option which allows even more aggressive tradeoffs of accuracy against speed.Lambdaageek 正确地指出,由于关联性不适用于浮点数,因此“优化将
a*a*a*a*a*a
改为(a*a*a)*(a*a*a)
可能会更改该值。这就是 C99 不允许它的原因(除非用户通过编译器标志或编译指示明确允许)。一般来说,假设程序员编写她所做的事情是有原因的,编译器应该尊重这一点。如果您想要(a*a*a)*(a*a*a)
,请这样写。不过,写起来可能会很痛苦;当您使用
pow(a,6)
时,为什么编译器不能做[您认为]正确的事情?因为这是错误的做法。在具有良好数学库的平台上,pow(a,6)
比a*a*a*a*a*a
或准确得多(a*a*a)*(a*a*a)
。为了提供一些数据,我在 Mac Pro 上进行了一个小实验,测量了对 [1,2) 之间的所有单精度浮点数评估 a^6 时的最严重错误:使用
pow
而不是乘法树将误差范围减少了 4 倍。编译器不应该(通常也不)进行会增加错误的“优化”,除非用户许可这样做(例如通过-ffast-math
)。请注意,GCC 提供 __builtin_powi(x,n) 作为
pow( )
的替代方案,它应该生成内联乘法树。如果您想牺牲准确性来换取性能,但又不想启用快速数学,请使用它。Lambdageek correctly points out that because associativity does not hold for floating-point numbers, the "optimization" of
a*a*a*a*a*a
to(a*a*a)*(a*a*a)
may change the value. This is why it is disallowed by C99 (unless specifically allowed by the user, via compiler flag or pragma). Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want(a*a*a)*(a*a*a)
, write that.That can be a pain to write, though; why can't the compiler just do [what you consider to be] the right thing when you use
pow(a,6)
? Because it would be the wrong thing to do. On a platform with a good math library,pow(a,6)
is significantly more accurate than eithera*a*a*a*a*a
or(a*a*a)*(a*a*a)
. Just to provide some data, I ran a small experiment on my Mac Pro, measuring the worst error in evaluating a^6 for all single-precision floating numbers between [1,2):Using
pow
instead of a multiplication tree reduces the error bound by a factor of 4. Compilers should not (and generally do not) make "optimizations" that increase error unless licensed to do so by the user (e.g. via-ffast-math
).Note that GCC provides
__builtin_powi(x,n)
as an alternative topow( )
, which should generate an inline multiplication tree. Use that if you want to trade off accuracy for performance, but do not want to enable fast-math.另一个类似的情况:大多数编译器不会将
a + b + c + d
优化为(a + b) + (c + d)
(这是一种优化,因为第二个表达式可以更好地进行管道化)并按给定的方式对其进行评估(即(((a + b) + c) + d)
)。这也是因为极端情况:输出
1.000000e-05 0.000000e+00
Another similar case: most compilers won't optimize
a + b + c + d
to(a + b) + (c + d)
(this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as(((a + b) + c) + d)
). This too is because of corner cases:This outputs
1.000000e-05 0.000000e+00
Fortran(专为科学计算而设计)有一个内置的幂运算符,据我所知,Fortran 编译器通常会以与您描述的类似的方式优化整数幂的提升。遗憾的是,C/C++ 没有幂运算符,只有库函数
pow()
。这并不妨碍智能编译器专门处理pow
并在特殊情况下以更快的方式计算它,但似乎他们这样做的情况不太常见......几年前我试图让它更方便以最佳方式计算整数幂,并得出以下结论。它是 C++,而不是 C,并且仍然依赖于编译器对于如何优化/内联事物的智能程度。不管怎样,希望你会发现它在实践中有用:
好奇的澄清:这并没有找到计算能力的最佳方法,但自从找到最佳解决方案是一个 NP 完全问题,而且这仅对于小幂才值得这样做(而不是使用
pow
),没有理由大惊小怪细节。然后将其用作
power<6>(a)
。这使得输入幂变得很容易(无需用括号拼出 6 个
a
),并且可以让您在没有-ffast-math
的情况下进行这种优化,以防万一有一些依赖于精度的东西,例如补偿求和(操作顺序至关重要的示例)。您可能还会忘记这是 C++,而只是在 C 程序中使用它(如果它使用 C++ 编译器进行编译)。
希望这会有用。
编辑:
这是我从编译器得到的结果:
对于
a*a*a*a*a*a
,对于
(a*a*a)* (a*a*a)
,对于
power<6>(a)
,Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don't have a power operator, only the library function
pow()
. This doesn't prevent smart compilers from treatingpow
specially and computing it in a faster way for special cases, but it seems they do it less commonly ...Some years ago I was trying to make it more convenient to calculate integer powers in an optimal way, and came up with the following. It's C++, not C though, and still depends on the compiler being somewhat smart about how to optimize/inline things. Anyway, hope you might find it useful in practice:
Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using
pow
), there's no reason to fuss with the detail.Then just use it as
power<6>(a)
.This makes it easy to type powers (no need to spell out 6
a
s with parens), and lets you have this kind of optimization without-ffast-math
in case you have something precision dependent such as compensated summation (an example where the order of operations is essential).You can probably also forget that this is C++ and just use it in the C program (if it compiles with a C++ compiler).
Hope this can be useful.
EDIT:
This is what I get from my compiler:
For
a*a*a*a*a*a
,For
(a*a*a)*(a*a*a)
,For
power<6>(a)
,当 a 是整数时,GCC 实际上会将
a*a*a*a*a*a
优化为(a*a*a)*(a*a*a)
。我尝试使用这个命令:有很多 gcc 标志,但没有什么花哨的。它们的意思是:从标准输入读取;使用O2优化级别;输出汇编语言列表而不是二进制文件;该清单应使用英特尔汇编语言语法;输入是C语言(通常从输入文件扩展名推断语言,但从stdin读取时没有文件扩展名);并写入标准输出。
这是输出的重要部分。我用一些注释对其进行了注释,表明汇编语言中发生了什么:
我在 Linux Mint 16 Petra(Ubuntu 的衍生版本)上使用系统 GCC。这是 gcc 版本:
正如其他发帖者所指出的,此选项在浮点中是不可能的,因为浮点运算不具有关联性。
GCC does actually optimize
a*a*a*a*a*a
to(a*a*a)*(a*a*a)
when a is an integer. I tried with this command:There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.
Here's the important part of the output. I've annotated it with some comments indicating what's going on in the assembly language:
I'm using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here's the gcc version:
As other posters have noted, this option is not possible in floating point, because floating point arithmetic is not associative.
因为 32 位浮点数(例如 1.024)不是 1.024。在计算机中,1.024是一个区间:从(1.024-e)到(1.024+e),其中“e”代表错误。有些人没有意识到这一点,还认为 a*a 中的 * 代表任意精度数字的乘法,并且这些数字不会附加任何错误。有些人未能意识到这一点的原因可能是他们在小学时进行的数学计算:只使用不带错误的理想数字,并认为在执行乘法时忽略“e”是可以的。他们看不到“float a=1.2”、“a*a*a”和类似 C 代码中隐含的“e”。
如果大多数程序员认识到(并且能够执行)C 表达式 a*a*a*a*a*a 实际上并不适用于理想数,那么 GCC 编译器将可以自由地优化“a*a” *a*a*a*a”变为“t=(a*a); t*t*t”,这需要较少的乘法次数。但不幸的是,GCC编译器不知道编写代码的程序员是否认为“a”是一个有错误的数字。因此 GCC 只会执行源代码的样子 - 因为那是 GCC 用“肉眼”看到的。
...一旦您知道您是哪种程序员,您就可以使用“-ffast-math”开关告诉 GCC“嘿,GCC,我知道我在做什么!”。这将允许 GCC 将 a*a*a*a*a*a 转换为不同的文本片段 - 它看起来与 a*a*a*a*a*a 不同 - 但仍然计算错误间隔内的数字一个*一个*一个*一个*一个*一个。这没关系,因为您已经知道您正在使用间隔,而不是理想数字。
Because a 32-bit floating-point number - such as 1.024 - is not 1.024. In a computer, 1.024 is an interval: from (1.024-e) to (1.024+e), where "e" represents an error. Some people fail to realize this and also believe that * in a*a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers. The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools: working only with ideal numbers without errors attached, and believing that it is OK to simply ignore "e" while performing multiplication. They do not see the "e" implicit in "float a=1.2", "a*a*a" and similar C codes.
Should majority of programmers recognize (and be able to execute on) the idea that C expression a*a*a*a*a*a is not actually working with ideal numbers, the GCC compiler would then be FREE to optimize "a*a*a*a*a*a" into say "t=(a*a); t*t*t" which requires a smaller number of multiplications. But unfortunately, the GCC compiler does not know whether the programmer writing the code thinks that "a" is a number with or without an error. And so GCC will only do what the source code looks like - because that is what GCC sees with its "naked eye".
... once you know what kind of programmer you are, you can use the "-ffast-math" switch to tell GCC that "Hey, GCC, I know what I am doing!". This will allow GCC to convert a*a*a*a*a*a into a different piece of text - it looks different from a*a*a*a*a*a - but still computes a number within the error interval of a*a*a*a*a*a. This is OK, since you already know you are working with intervals, not ideal numbers.
还没有海报提到浮动表达式的收缩(ISO C 标准,6.5p8 和 7.12.2)。如果
FP_CONTRACT
pragma 设置为ON
,则允许编译器考虑诸如a*a*a*a*a*a
之类的表达式> 作为单个操作,就好像通过单个舍入精确计算一样。例如,编译器可以用更快、更准确的内部幂函数来替换它。这是特别有趣的,因为行为部分地由程序员直接在源代码中控制,而最终用户提供的编译器选项有时可能会被错误地使用。FP_CONTRACT 编译指示的默认状态是实现定义的,因此默认情况下允许编译器执行此类优化。因此,需要严格遵循 IEEE 754 规则的可移植代码应明确将其设置为
OFF
。如果编译器不支持此编译指示,则必须通过避免任何此类优化来保守,以防开发人员选择将其设置为
OFF
。GCC 不支持此编译指示,但使用默认选项时,它假定它为
ON
;因此,对于具有硬件 FMA 的目标,如果想要阻止将a*b+c
转换为 fma(a,b,c),则需要提供一个选项,例如-ffp -contract=off
(显式地将编译指示设置为OFF
)或-std=c99
(告诉GCC遵守某些C标准版本,这里是C99 ,因此遵循上面的段落)。在过去,后一个选项不会阻止转换,这意味着 GCC 在这一点上不符合: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845No posters have mentioned the contraction of floating expressions yet (ISO C standard, 6.5p8 and 7.12.2). If the
FP_CONTRACT
pragma is set toON
, the compiler is allowed to regard an expression such asa*a*a*a*a*a
as a single operation, as if evaluated exactly with a single rounding. For instance, a compiler may replace it by an internal power function that is both faster and more accurate. This is particularly interesting as the behavior is partly controlled by the programmer directly in the source code, while compiler options provided by the end user may sometimes be used incorrectly.The default state of the
FP_CONTRACT
pragma is implementation-defined, so that a compiler is allowed to do such optimizations by default. Thus portable code that needs to strictly follow the IEEE 754 rules should explicitly set it toOFF
.If a compiler doesn't support this pragma, it must be conservative by avoiding any such optimization, in case the developer has chosen to set it to
OFF
.GCC doesn't support this pragma, but with the default options, it assumes it to be
ON
; thus for targets with a hardware FMA, if one wants to prevent the transformationa*b+c
to fma(a,b,c), one needs to provide an option such as-ffp-contract=off
(to explicitly set the pragma toOFF
) or-std=c99
(to tell GCC to conform to some C standard version, here C99, thus follow the above paragraph). In the past, the latter option was not preventing the transformation, meaning that GCC was not conforming on this point: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845像“pow”这样的库函数通常经过精心设计,以产生尽可能小的错误(在一般情况下)。这通常是通过样条线逼近函数来实现的(根据 Pascal 的评论,最常见的实现似乎是使用 Remez 算法)
基本上是以下运算:
具有与任何单个乘法或除法中的误差大约相同大小的固有误差。
而以下运算:
具有大于单个乘法或除法误差5 倍的固有误差(因为您要组合 5 个乘法)。
编译器应该非常小心它正在执行的优化类型:
pow(a,6)
优化为a*a*a*a*a*a
它可能提高性能,但会大大降低浮点数的精度。a*a*a*a*a*a
优化为pow(a,6)
它实际上可能会降低准确性,因为“a”是一些特殊值,允许 则无错误的乘法(2 的幂或某个小整数)pow(a,6)
优化为(a*a*a)*(a*a*a)
, 代码> 或(a*a)*(a*a)*(a*a)
与pow
函数相比,精度仍然会有所损失。一般来说,您知道对于任意浮点值“pow”比您最终编写的任何函数都具有更好的准确性,但在某些特殊情况下,多次乘法可能具有更好的准确性和性能,这取决于开发人员选择更合适的方法,最终对代码进行注释,以便其他人不会“优化”该代码。
唯一有意义的优化(个人意见,显然是 GCC 中没有任何特定优化或编译器标志的选择)应该是用“a*a”替换“pow(a,2)”。这将是编译器供应商应该做的唯一明智的事情。
Library functions like "pow" are usually carefully crafted to yield the minimum possible error (in generic case). This is usually achieved approximating functions with splines (according to Pascal's comment the most common implementation seems to be using Remez algorithm)
fundamentally the following operation:
has a inherent error of approximately the same magnitude as the error in any single multiplication or division.
While the following operation:
has a inherent error that is greater more than 5 times the error of a single multiplication or division (because you are combining 5 multiplications).
The compiler should be really carefull to the kind of optimization it is doing:
pow(a,6)
toa*a*a*a*a*a
it may improve performance, but drastically reduce the accuracy for floating point numbers.a*a*a*a*a*a
topow(a,6)
it may actually reduce the accuracy because "a" was some special value that allows multiplication without error (a power of 2 or some small integer number)pow(a,6)
to(a*a*a)*(a*a*a)
or(a*a)*(a*a)*(a*a)
there still can be a loss of accuracy compared topow
function.In general you know that for arbitrary floating point values "pow" has better accuracy than any function you could eventually write, but in some special cases multiple multiplications may have better accuracy and performance, it is up to the developer choosing what is more appropriate, eventually commenting the code so that noone else would "optimize" that code.
The only thing that make sense (personal opinion, and apparently a choice in GCC wichout any particular optimization or compiler flag) to optimize should be replacing "pow(a,2)" with "a*a". That would be the only sane thing a compiler vendor should do.
正如 Lambdaagek 指出的那样,浮点乘法不是关联性的,你可能会得到较低的精度,但当获得更好的精度时,你可以反对优化,因为你想要一个确定性的应用程序。例如,在游戏模拟客户端/服务器中,每个客户端都必须模拟同一个世界,您希望浮点计算具有确定性。
As Lambdageek pointed out float multiplication is not associative and you can get less accuracy, but also when get better accuracy you can argue against optimisation, because you want a deterministic application. For example in game simulation client/server, where every client has to simulate the same world you want floating point calculations to be deterministic.
我根本没想到这个案例会得到优化。表达式包含可以重新组合以删除整个操作的子表达式的情况并不常见。我希望编译器编写者将时间投入到更有可能带来显着改进的领域,而不是覆盖很少遇到的边缘情况。
我很惊讶地从其他答案中得知这个表达式确实可以通过适当的编译器开关进行优化。要么优化是微不足道的,要么是更常见优化的边缘情况,要么编译器编写者非常彻底。
正如您在此处所做的那样,向编译器提供提示没有任何问题。重新排列语句和表达式以查看它们会带来什么差异是微优化过程中正常且预期的一部分。
虽然编译器可能有理由考虑这两个表达式提供不一致的结果(没有适当的开关),但您没有必要受到该限制的约束。差异将非常小,以至于如果差异对您很重要,那么您首先就不应该使用标准浮点运算。
I would not have expected this case to be optimized at all. It can't be very often where an expression contains subexpressions that can be regrouped to remove entire operations. I would expect compiler writers to invest their time in areas which would be more likely to result in noticeable improvements, rather than covering a rarely encountered edge case.
I was surprised to learn from the other answers that this expression could indeed be optimized with the proper compiler switches. Either the optimization is trivial, or it is an edge case of a much more common optimization, or the compiler writers were extremely thorough.
There's nothing wrong with providing hints to the compiler as you've done here. It's a normal and expected part of the micro-optimization process to rearrange statements and expressions to see what differences they will bring.
While the compiler may be justified in considering the two expressions to deliver inconsistent results (without the proper switches), there's no need for you to be bound by that restriction. The difference will be incredibly tiny - so much so that if the difference matters to you, you should not be using standard floating point arithmetic in the first place.
这个问题已经有一些很好的答案,但为了完整起见,我想指出 C 标准的适用部分是 5.1.2.2.3/15(与 C 标准中的 1.9/9 部分相同) C++11 标准)。本节指出,只有当运算符真正具有关联性或可交换性时,才可以对其进行重新分组。
There are already a few good answers to this question, but for the sake of completeness I wanted to point out that the applicable section of the C standard is 5.1.2.2.3/15 (which is the same as section 1.9/9 in the C++11 standard). This section states that operators can only be regrouped if they are really associative or commutative.
gcc 实际上可以进行这种优化,即使对于浮点数也是如此。例如,
变为
-O -funsafe-math-optimizations
。不过,这种重新排序违反了 IEEE-754,因此需要该标志。正如 Peter Cordes 在评论中指出的那样,有符号整数可以在没有
-funsafe-math-optimizations
的情况下进行此优化,因为它在没有溢出时准确保持,如果有溢出,则会出现未定义的行为。所以你只需使用
-O
即可。对于无符号整数,它甚至更容易,因为它们工作于 2 的 mod 幂,因此即使在溢出的情况下也可以自由地重新排序。gcc actually can do this optimization, even for floating-point numbers. For example,
becomes
with
-O -funsafe-math-optimizations
. This reordering violates IEEE-754, though, so it requires the flag.Signed integers, as Peter Cordes pointed out in a comment, can do this optimization without
-funsafe-math-optimizations
since it holds exactly when there is no overflow and if there is overflow you get undefined behavior. So you getwith just
-O
. For unsigned integers, it's even easier since they work mod powers of 2 and so can be reordered freely even in the face of overflow.