为什么将 0.1f 更改为 0 会使性能降低 10 倍?
为什么这段代码的
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
运行速度比后面的代码快 10 倍以上(除非另有说明,否则完全相同)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
使用 Visual Studio 2010 SP1 编译时。 启用 sse2
时优化级别为 -02
。 我还没有用其他编译器进行测试。
Why does this bit of code,
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
run more than 10 times faster than the following bit (identical except where noted)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
when compiling with Visual Studio 2010 SP1.
The optimization level was -02
with sse2
enabled.
I haven't tested with other compilers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
欢迎来到非规范化浮点的世界!它们可能会造成严重破坏论性能!!!
非正规(或次正规)数字是一种从浮点表示中获取一些非常接近于零的额外值的黑客方法。非规范化浮点运算可能比规范化浮点运算慢数十到数百倍。这是因为许多处理器无法直接处理它们,必须使用微代码捕获和解析它们。
如果您在 10,000 次迭代后打印出数字,您将看到它们已收敛到不同的值,具体取决于使用的是
0
还是0.1
。以下是在 x64 上编译的测试代码:
输出:
请注意,在第二次运行中,数字非常接近于零。
非规范化数字通常很少见,因此大多数处理器不会尝试有效地处理它们。
为了证明这与非规范化数字有关,如果我们通过将以下内容添加到代码开头来将非规范化为零:
那么带有
0
的版本不再是慢 10 倍,实际上变得更快。 (这要求在启用 SSE 的情况下编译代码。)这意味着我们不使用这些奇怪的低精度几乎为零的值,而是四舍五入到零。
时序:Core i7 920 @ 3.5 GHz:
最后,这实际上与它是整数还是浮点无关。
0
或0.1f
被转换/存储到两个循环外部的寄存器中。所以这对性能没有影响。Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!
Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.
If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether
0
or0.1
is used.Here's the test code compiled on x64:
Output:
Note how in the second run the numbers are very close to zero.
Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.
To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:
Then the version with
0
is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.
Timings: Core i7 920 @ 3.5 GHz:
In the end, this really has nothing to do with whether it's an integer or floating-point. The
0
or0.1f
is converted/stored into a register outside of both loops. So that has no effect on performance.使用 gcc 并对生成的程序集应用 diff 只会产生这样的差异:
cvtsi2ssq 确实慢了 10 倍。
显然,
float
版本使用从内存加载的XMM寄存器,而int
版本使用cvtsi2ssq
指令将真实的int
值 0 转换为float
,采用很多时间。将-O3
传递给 gcc 没有帮助。 (gcc 版本 4.2.1。)(使用
double
而不是float
并不重要,只是它将cvtsi2ssq
更改为>cvtsi2sdq
。)更新
一些额外的测试表明它不一定是
cvtsi2ssq
指令。一旦消除(使用int ai=0;float a=ai;
并使用a
而不是0
),速度差异仍然存在。所以@Mysticial 是对的,非规范化的浮点数会产生影响。通过测试0
和0.1f
之间的值可以看出这一点。上面代码中的转折点大约在0.00000000000000000000000000000001
处,此时循环的时间突然增加了 10 倍。更新<<1
这个有趣现象的一个小可视化:
当非规范化开始时,您可以清楚地看到指数(最后 9 位)变为最低值。此时,简单的加法速度会慢 20 倍。
关于 ARM 的等效讨论可以在 StackOverflow 问题中找到Objective-C 中的非规范化浮点?。
Using
gcc
and applying a diff to the generated assembly yields only this difference:The
cvtsi2ssq
one being 10 times slower indeed.Apparently, the
float
version uses an XMM register loaded from memory, while theint
version converts a realint
value 0 tofloat
using thecvtsi2ssq
instruction, taking a lot of time. Passing-O3
to gcc doesn't help. (gcc version 4.2.1.)(Using
double
instead offloat
doesn't matter, except that it changes thecvtsi2ssq
into acvtsi2sdq
.)Update
Some extra tests show that it is not necessarily the
cvtsi2ssq
instruction. Once eliminated (using aint ai=0;float a=ai;
and usinga
instead of0
), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between0
and0.1f
. The turning point in the above code is approximately at0.00000000000000000000000000000001
, when the loops suddenly takes 10 times as long.Update<<1
A small visualisation of this interesting phenomenon:
You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.
An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.
这是由于非规范化浮点使用造成的。如何摆脱它和性能损失?在互联网上搜索了消除非正规数的方法后,似乎还没有“最佳”方法来做到这一点。我发现这三种方法在不同的环境中可能效果最好:
可能在某些 GCC 环境中不起作用:
可能无法在某些 Visual Studio 环境中工作:1< /a>
似乎在这两种环境中都可以工作GCC 和 Visual Studio:
Intel 编译器具有在现代 Intel CPU 上默认禁用非规范的选项。 更多详细信息此处
编译器开关。
-ffast-math
、-msse
或-mfpmath=sse
将禁用非正规化并使其他一些事情更快,但不幸的是也做了很多其他可能会破坏您的代码的近似值。仔细测试一下! Visual Studio 编译器的快速数学等效项是/fp:fast
但我无法确认这是否也会禁用非正规化。1It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:
Might not work in some GCC environments:
Might not work in some Visual Studio environments: 1
Appears to work in both GCC and Visual Studio:
The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here
Compiler switches.
-ffast-math
,-msse
or-mfpmath=sse
will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is/fp:fast
but I haven't been able to confirm whether this also disables denormals.1丹·尼利的评论应该扩展为答案:
它不是零常数
0.0f
非规范化或导致速度减慢,它是循环每次迭代都接近零的值。随着它们越来越接近零,它们需要更高的精度来表示,并且它们变得非规范化。这些是y[i]
值。 (它们接近零,因为所有i
的x[i]/z[i]
都小于 1.0。)慢速版本和快速版本的代码之间的关键区别是语句
y[i] = y[i] + 0.1f;
。一旦循环的每次迭代执行此行,浮点中的额外精度就会丢失,并且不再需要表示该精度所需的非规范化。此后,y[i]
上的浮点运算仍然很快,因为它们没有非规范化。为什么添加
0.1f
后额外的精度会丢失?因为浮点数只有这么多有效数字。假设您有足够的存储空间来容纳三位有效数字,然后是0.00001 = 1e-5
和0.00001 + 0.1 = 0.1
,至少对于本示例的浮点格式来说是这样,因为它没有0.10001
中没有空间存储最低有效位。简而言之,y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; 不是您可能认为的无操作。
Mystical 也这么说过< /a>:浮动的内容很重要,而不仅仅是汇编代码。
编辑:为了更详细地说明这一点,即使机器操作码相同,也并非每个浮点运算都需要相同的时间来运行。对于某些操作数/输入,相同的指令将需要更多的时间来运行。对于非正规数尤其如此。
Dan Neely's comment ought to be expanded into an answer:
It is not the zero constant
0.0f
that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are they[i]
values. (They approach zero becausex[i]/z[i]
is less than 1.0 for alli
.)The crucial difference between the slow and fast versions of the code is the statement
y[i] = y[i] + 0.1f;
. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations ony[i]
remain fast because they aren't denormalized.Why is the extra precision lost when you add
0.1f
? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then0.00001 = 1e-5
, and0.00001 + 0.1 = 0.1
, at least for this example float format, because it doesn't have room to store the least significant bit in0.10001
.In short,
y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;
isn't the no-op you might think it is.Mystical said this as well: the content of the floats matters, not just the assembly code.
EDIT: To put a finer point on this, not every floating point operation takes the same amount of time to run, even if the machine opcode is the same. For some operands/inputs, the same instruction will take more time to run. This is especially true for denormal numbers.
在 gcc 中,您可以通过以下方式启用 FTZ 和 DAZ:
还可以使用 gcc 开关:-msse -mfpmath=sse
(对应 Carl Hetherington [1])
[1] http://carlh.net/plugins/denormals.php
In gcc you can enable FTZ and DAZ with this:
also use gcc switches: -msse -mfpmath=sse
(corresponding credits to Carl Hetherington [1])
[1] http://carlh.net/plugins/denormals.php
在很长一段时间内,CPU 对于非正规数只会慢一点。我的 Zen2 CPU 需要五个时钟周期来进行非正规输入和非正规输出的计算,以及四个标准化数字的时钟周期。
这是一个用 Visual C++ 编写的小型基准测试,用于显示非正规数对性能的轻微影响:
这是 MASM 汇编部分。
很高兴在评论中看到一些结果。
CPUs are only a bit slower for denormal numbers for a long time. My Zen2 CPU needs five clock cycles for a computation with denormal inputs and denormal outputs and four clock cycles with a normalized number.
This is a small benchmark written with Visual C++ to show the slightly peformance-degrading effect of denormal numbers:
This is the MASM assembly part.
It would be nice to see some results in the comments.
2023 年更新,在 Ryzen 3990x、gcc 10.2 上,编译选项
-O3 -mavx2 -march=native
,2 版本之间的区别是所以它仍然更慢,但不是慢 10 倍。
Update for 2023, on Ryzen 3990x, gcc 10.2, compile option
-O3 -mavx2 -march=native
, the difference between the 2 version isSo it's still slower, but not 10x slower.