现代硬件上的浮点与整数计算
我正在 C++ 中做一些性能关键的工作,我们目前正在使用整数计算来解决本质上是浮点的问题,因为“它更快”。这会导致很多烦人的问题并添加很多烦人的代码。
现在,我记得读到大约 386 天浮点计算如何如此缓慢,我相信(IIRC)有一个可选的协处理器。但现在 CPU 的复杂度和功能呈指数级增长,如果进行浮点或整数计算,“速度”没有什么区别吗?特别是因为与导致管道停顿或从主内存中获取某些内容相比,实际计算时间很短?
我知道正确的答案是在目标硬件上进行基准测试,测试这个的好方法是什么?我编写了两个小型 C++ 程序,并将它们的运行时间与 Linux 上的“时间”进行了比较,但实际运行时间变化太大(对我在虚拟服务器上运行没有帮助)。不用花一整天的时间运行数百个基准测试、制作图表等,我可以做些什么来对相对速度进行合理的测试吗?有什么想法或想法吗?我完全错了吗?
我使用的程序如下,它们无论如何都不相同:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
int accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += rand( ) % 365;
}
std::cout << accum << std::endl;
return 0;
}
程序 2:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
float accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += (float)( rand( ) % 365 );
}
std::cout << accum << std::endl;
return 0;
}
编辑:我关心的平台是在桌面 Linux 和 Windows 机器上运行的常规 x86 或 x86-64。
编辑 2(从下面的评论粘贴):我们目前拥有广泛的代码库。实际上,我遇到了这样的概括:我们“不能使用浮点,因为整数计算速度更快” - 我正在寻找一种方法(如果这是真的)来反驳这种概括的假设。我意识到,如果不完成所有工作并事后对其进行分析,就不可能预测确切的结果。
无论如何,感谢您的出色回答和帮助。请随意添加其他内容:)。
I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.
Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?
I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?
The programs I used as follows, they are not identical by any means:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
int accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += rand( ) % 365;
}
std::cout << accum << std::endl;
return 0;
}
Program 2:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
float accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += (float)( rand( ) % 365 );
}
std::cout << accum << std::endl;
return 0;
}
Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.
Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.
Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
例如(数字越小速度越快)、
64 位 Intel Xeon X5550 @ 2.67GHz、gcc 4.1.2
-O3
32 位双核 AMD Opteron(tm) 处理器 265 @ 1.81GHz、gcc 3.4 .6
-O3
正如 Dan 指出的,即使你标准化了时钟频率(这可以在流水线设计中本身会产生误导),结果将根据 CPU 架构而有很大差异(个别ALU/FPU 性能,以及中每个内核可用的实际ALU/FPU数量超标量设计影响多少独立操作可以并行执行——后一个因素下面的代码不会执行,因为下面的所有操作都是顺序相关的。)
穷人的 FPU/ALU 操作基准:
For example (lesser numbers are faster),
64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2
-O3
32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6
-O3
As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)
Poor man's FPU/ALU operation benchmark:
TIL 这变化很大(很多)。以下是使用 gnu 编译器的一些结果(顺便说一句,我还通过在机器上编译进行了检查,xenial 的 gnu g++ 5.4 比 linaro 的 4.6.3 精确得多)
Intel i7 4700MQ xenial
Intel i3 2370M 具有类似的结果
Intel( R) Celeron(R) 2955U(运行 xenial 的 Acer C720 Chromebook)
DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2(运行可靠)
AMD Opteron(tm) 处理器 4122(精确)
这使用来自 http://pastebin.com/Kx8WGUfg 作为
benchmark-pc.c
我运行了多个通过,但这似乎是一般数字相同的情况。
一个值得注意的例外似乎是 ALU mul 与 FPU mul 的比较。加法和减法似乎略有不同。
以下是上面的图表形式(点击查看大图,越低越好):
更新以适应 @Peter Cordes
https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc
i7 4700MQ Linux Ubuntu Xenial 64 位(2018 年 3 月 13 日的所有补丁应用)
AMD Opteron(tm) 处理器4122(精确,DreamHost 共享主机)
Intel Xeon E5-2630L v2 @ 2.4GHz(Trusty 64 位,DigitalOcean VPS)
Apple Mac Mini M1
Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
TIL This varies (a lot). Here are some results using gnu compiler (btw I also checked by compiling on machines, gnu g++ 5.4 from xenial is a hell of a lot faster than 4.6.3 from linaro on precise)
Intel i7 4700MQ xenial
Intel i3 2370M has similar results
Intel(R) Celeron(R) 2955U (Acer C720 Chromebook running xenial)
DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2 (running trusty)
AMD Opteron(tm) Processor 4122 (precise)
This uses code from http://pastebin.com/Kx8WGUfg as
benchmark-pc.c
I've run multiple passes, but this seems to be the case that general numbers are the same.
One notable exception seems to be ALU mul vs FPU mul. Addition and subtraction seem trivially different.
Here is the above in chart form (click for full size, lower is faster and preferable):
Update to accomodate @Peter Cordes
https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc
i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)
AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)
Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)
Apple Mac Mini M1
Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
唉,我只能给你一个“视情况而定”的答案......
根据我的经验,性能有很多很多变量......尤其是在整数和整数之间。浮点数学。由于不同的处理器具有不同的“管道”长度,因此不同处理器之间的差异很大(即使在同一系列中,例如 x86)。此外,某些操作通常非常简单(例如加法)并且通过处理器具有加速路线,而其他操作(例如除法)则需要更长的时间。
另一个大变量是数据所在的位置。如果您只需要添加几个值,那么所有数据都可以驻留在缓存中,并可以在其中快速发送到 CPU。缓存中已经有数据的非常非常慢的浮点操作将比需要从系统内存复制整数的整数操作快很多倍。
我假设您问这个问题是因为您正在开发性能关键型应用程序。如果您正在针对 x86 架构进行开发,并且需要额外的性能,您可能需要考虑使用 SSE 扩展。这可以大大加快单精度浮点运算的速度,因为可以同时对多个数据执行相同的操作,而且还有一个单独的寄存器组用于 SSE 操作。 (我注意到在你的第二个例子中你使用了“float”而不是“double”,让我认为你正在使用单精度数学)。
*注意:使用旧的 MMX 指令实际上会减慢程序速度,因为这些旧指令实际上使用与 FPU 相同的寄存器,因此不可能同时使用 FPU 和 MMX。
Alas, I can only give you an "it depends" answer...
From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.
The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.
I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).
*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.
定点和浮点数学之间的实际速度可能存在显着差异,但 ALU 与 FPU 的理论最佳情况吞吐量完全无关。相反,您的架构上的整数和浮点寄存器(实际寄存器,而不是寄存器名称)的数量,这些寄存器不被您的计算使用(例如用于循环控制),适合缓存行的每种类型的元素数量,考虑到整数与浮点数学的不同语义,可能进行优化——这些影响将占主导地位。算法的数据依赖性在这里发挥着重要作用,因此任何一般比较都无法预测问题的性能差距。
例如,整数加法是可交换的,因此如果编译器看到像用于基准测试的循环(假设随机数据是提前准备好的,这样就不会模糊结果),它可以展开循环并计算部分和没有依赖项,然后在循环终止时添加它们。但是对于浮点,编译器必须按照您请求的相同顺序执行操作(其中有序列点,因此编译器必须保证相同的结果,这不允许重新排序),因此每个加法都强烈依赖于上一个的结果。
您也可能一次在缓存中放入更多的整数操作数。因此,即使在 FPU 理论上具有更高吞吐量的机器上,定点版本的性能也可能比浮点版本高出一个数量级。
There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.
For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.
You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.
加法比 rand 快得多,所以你的程序(尤其是)没用。
您需要识别性能热点并逐步修改您的程序。听起来您的开发环境存在问题,需要首先解决。对于一个小问题集,是否无法在 PC 上运行您的程序?
一般来说,尝试使用整数算术进行 FP 作业会导致速度变慢。
Addition is much faster than
rand
, so your program is (especially) useless.You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?
Generally, attempting FP jobs with integer arithmetic is a recipe for slow.
需要考虑的两点 -
现代硬件可以重叠指令、并行执行它们并对它们重新排序,以充分利用硬件。而且,任何重要的浮点程序也可能有重要的整数工作,即使它只是计算数组的索引、循环计数器等。所以即使你有一个缓慢的浮点指令,它也可能在单独的硬件上运行与一些整数工作重叠。我的观点是,即使浮点指令比整数指令慢,您的整个程序也可能运行得更快,因为它可以利用更多的硬件。
与往常一样,唯一确定的方法是分析您的实际程序。
第二点是,现在大多数 CPU 都具有用于浮点的 SIMD 指令,可以同时对多个浮点值进行操作。例如,您可以将 4 个浮点数加载到单个 SSE 寄存器中,并对它们并行执行 4 次乘法。如果您可以重写部分代码以使用 SSE 指令,那么它似乎会比整数版本更快。 Visual c++ 提供了编译器内部函数来执行此操作,请参阅 http ://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx 了解一些信息。
Two points to consider -
Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.
As always, the only way to be sure is to profile your actual program.
Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx for some information.
除非您正在编写每秒被调用数百万次的代码(例如,在图形应用程序中在屏幕上绘制一条线),否则整数与浮点运算很少是瓶颈。
解决效率问题的第一步通常是分析代码以了解运行时真正花费在哪里。用于此目的的 Linux 命令是 gprof。
编辑:
虽然我认为您始终可以使用整数和浮点数实现线条绘制算法,但请多次调用它并查看是否有区别:
http://en.wikipedia.org/wiki/Bresenham's_algorithm
Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.
The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is
gprof
.Edit:
Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:
http://en.wikipedia.org/wiki/Bresenham's_algorithm
如果没有余数运算,浮点版本会慢得多。由于所有加法都是连续的,因此 cpu 将无法并行求和。延迟将是至关重要的。 FPU 加法延迟通常为 3 个周期,而整数加法为 1 个周期。然而,余数运算符的除法器可能是关键部分,因为它在现代 CPU 上并未完全流水线化。因此,假设除法/取余指令将消耗大部分时间,则由于加法延迟而导致的差异将会很小。
The floating point version will be much slower, if there is no remainder operation. Since all the adds are sequential, the cpu will not be able to parallelise the summation. The latency will be critical. FPU add latency is typically 3 cycles, while integer add is 1 cycle. However, the divider for the remainder operator will probably the critical part, as it is not fully pipelined on modern cpu's. so, assuming the divide/remainder instruction will consume the bulk of the time, the difference due to add latency will be small.
如今,整数运算通常比浮点运算快一点。因此,如果您可以使用整数和浮点进行相同的运算进行计算,请使用整数。然而你说“这会导致很多烦人的问题并添加很多烦人的代码”。听起来您需要更多运算,因为您使用整数算术而不是浮点。在这种情况下,浮点会运行得更快,因为
一旦您需要更多的整数运算,您可能需要更多,因此轻微的速度优势很快就会被额外的运算所吞噬
浮点代码更简单,这意味着编写代码更快,这意味着如果速度至关重要,您可以花更多时间优化代码。
Today, integer operations are usually a little bit faster than floating point operations. So if you can do a calculation with the same operations in integer and floating point, use integer. HOWEVER you are saying "This causes a whole lot of annoying problems and adds a lot of annoying code". That sounds like you need more operations because you use integer arithmetic instead of floating point. In that case, floating point will run faster because
as soon as you need more integer operations, you probably need a lot more, so the slight speed advantage is more than eaten up by the additional operations
the floating-point code is simpler, which means it is faster to write the code, which means that if it is speed critical, you can spend more time optimising the code.
我运行了一个测试,只是在数字上加 1,而不是 rand()。结果(在 x86-64 上)为:
I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:
基于那个非常可靠的“我听说过的事情”,在过去,整数计算比浮点计算快大约 20 到 50 倍,而现在它的速度还不到两倍。
Based of that oh-so-reliable "something I've heard", back in the old days, integer calculation were about 20 to 50 times faster that floating point, and these days it's less than twice as faster.