浮点加法与浮点乘法的相对速度是多少

发布于 2024-07-29 05:45:37 字数 338 浏览 7 评论 0原文

一两年前,编写数字代码以避免使用乘法和除法并使用加法和减法是值得的。 一个很好的例子是使用前向差分来评估多项式曲线,而不是直接计算多项式。

情况仍然如此,还是现代计算机体系结构已经先进到 *,/ 不再比 +,- 慢许多倍?

具体来说,我对在具有广泛板载浮点硬件的现代典型 x86 芯片上运行的已编译 C/C++ 代码感兴趣,而不是尝试在软件中执行 FP 的小型微型芯片。 我意识到管道和其他架构增强排除了特定的周期计数,但我仍然希望获得有用的直觉。

A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

ゞ花落谁相伴 2024-08-05 05:45:37

它还取决于指令组合。 您的处理器将随时有多个计算单元处于待命状态,如果所有计算单元始终都被填满,您将获得最大吞吐量。 因此,执行 mul 循环与执行循环或相加一样快 - 但如果表达式变得更复杂,则情况就不一样了。

例如,采用以下循环:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

对于 NUMITER=10^7、NUMEL=10^2,两个数组都初始化为小正数(NaN 慢得多),在 64 位进程上使用双精度数需要 6.0 秒。 如果我用 It 替换循环,

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

只需要 1.7 秒……所以由于我们“过度”添加了内容,所以 muls 基本上是免费的; 减少添加量也有所帮助。 它变得更令人困惑:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- 相同的乘法/加法分布,但现在常数是相加而不是相乘 -- 需要 3.7 秒。 您的处理器可能经过优化,可以更有效地执行典型的数值计算; 所以像乘数之和和缩放之和这样的点积就已经是最好的了; 添加常量并不常见,因此速度较慢......

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

再次需要 1.7 秒。

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(与初始循环相同,但没有昂贵的常量加法:2.1 秒)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(主要是 muls,但一次加法:1.9 秒)

所以,基本上; 很难说哪个更快,但如果你想避免瓶颈,更重要的是有一个合理的组合,避免 NaN 或 INF,避免添加常量。 无论您做什么,请确保您进行了测试,并测试了各种编译器设置,因为通常很小的更改就可以带来不同。

还有一些案例:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86
森林迷了鹿 2024-08-05 05:45:37

理论上,信息位于:

Intel®64 和 IA-32 架构优化参考手册,附录 C 指令延迟和吞吐量

对于他们列出的每个处理器,FMUL 上的延迟非常接近 FADD 或 FDIV。 在一些较旧的处理器上,FDIV 比该速度慢 2-3 倍,而在较新的处理器上,它与 FMUL 相同。

注意事项:

  1. 我链接的文档实际上说你不能在现实生活中依赖这些数字,因为如果正确的话处理器会做它希望的事情,让事情变得更快。

  2. 您的编译器很有可能决定使用具有浮点乘法/除法功能的许多较新指令集之一。

    您的编译器很有可能决定使用

  3. 这是一个复杂的文档,仅供编译器编写者阅读,我可能弄错了。 就像我不清楚为什么某些 CPU 完全缺少 FDIV 延迟数字一样。

In theory the information is here:

Intel®64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT

For every processor they list, the latency on FMUL is very close to that of FADD or FDIV. On some of the older processors, FDIV is 2-3 time slower than that, while on newer processors, it's the same as FMUL.

Caveats:

  1. The document I linked actually says you can't rely on these numbers in real life since the processor will do what it wishes to make things faster if it's correct.

  2. There's a good chance your compiler will decide to use one of the many newer instruction sets that have a floating-point multiply / divide available.

  3. This is a complicated document only meant to be read by compiler writers and I might have gotten it wrong. Like I'm not clear why the FDIV latency number is completely missing for some of the CPUs.

反话 2024-08-05 05:45:37

回答这个问题的最佳方法是实际编写您需要执行的处理的基准/配置文件。 尽可能使用经验而非理论。 尤其是当它很容易实现时。

如果您已经知道需要执行的数学运算的不同实现,您可以编写一些不同的数学运算代码,然后看看您的性能峰值在哪里。 这将允许处理器/编译器生成不同的执行流来填充处理器管道,并为您提供具体的答案。

如果您对 DIV/MUL/ADD/SUB 类型指令的性能特别感兴趣,您甚至可以加入一些内联汇编来专门控制执行这些指令的哪些变体。 然而,您需要确保多个执行单元保持忙碌状态,以便更好地了解系统的性能。

执行类似的操作还可以让您通过简单地在处理器上运行相同的程序来比较多种处理器的性能,并且还可以让您考虑主板差异。

编辑:

+- 的基本架构是相同的。 所以从逻辑上讲,它们需要相同的时间来计算。 * 另一方面,需要多层,通常由“全加器”构建来完成单个操作。 这保证了虽然每个周期都可以向管道发出*,但它比加/减电路具有更高的延迟。 fp / 运算通常使用近似方法来实现,该方法随着时间的推移迭代地收敛到正确答案。 这些类型的近似通常通过乘法来实现。 因此,对于浮点,您通常可以假设除法将花费更长的时间,因为将乘法(其本身已经是一个大型电路)“展开”到多个乘法器电路的管道中是不切实际的。 不过,给定系统的性能最好通过测试来衡量。

The best way to answer this question is to actually write a benchmark/profile of the processing you need to do. Empirical should be used over theoretical when ever possible. Especially when it easy to attain.

If you already know different implementations of the Math you need to do, you could write a a few different code transfermations of the math and see where your performance peaks. This will allow the processor/compiler to generate different execution streams to fill the processor pipelines and give you a concrete answer to your answer.

If you are interest in specifically the performance of DIV/MUL/ADD/SUB type instructions you could even toss in some inline assembly to control specifically which variants of these instruction are executed. However you need to make sure you're keeping multilple execution units busy to get a good idea of the performance the system is capable of.

Also doing something like this would allow you to compare performance on multiple variations of the processor by simply running the same program on them, and could also allow you to factor in the motherboard differences.

Edit:

Basic architecture of a +- is identical. So they logically take the same time to compute. * on the other hand, require multiple layers, typically constructed out of "full adders" to complete a single operation. This garentees that while a * can be issued to the pipeline every cycle it will have a higher latency than an add/subtract circuit. A fp / operation is typically implemented using an approximation method which iteratively converges towards the correct answer over time. These types of approximations are typically implemented via multiplication. So for floating point you can generally assume division will take longer because it's impractical to "unroll" the multiplications (which is already a large circuit in and of it's self) into pipeline of a multitude of multiplier circuits. Still the performance of a given system is best measured via testing.

倒数 2024-08-05 05:45:37

我找不到明确的参考资料,但大量的实验告诉我,现在浮点乘法的速度与加法和减法的速度大致相同,而除法则不然(但也不是慢“很多倍”)。 你只能通过运行自己的实验来获得你想要的直觉——记住提前生成随机数(数百万个),在开始计时之前读取它们,并使用 CPU 的性能计数器(没有其他进程运行,因为尽可能阻止它们)以进行准确测量!

I can't find a definitive reference, but extensive experimentation tells me that float multiplication nowadays is just about the same speed as addition and subtraction, while division isn't (but not "many times" slower, either). You can get the intuition you desire only by running your own experiments -- remember to generate the random numbers (millions of them) in advance, read them before you start timing, and use the CPU's performance counters (with no other process running, as much as you can stop them from) for accurate measurement!

堇年纸鸢 2024-08-05 05:45:37

* / 与 + - 的速度差异取决于您的处理器架构。 一般来说,尤其是 x86,现代处理器的速度差异已经变得更小。 * 应接近 +,如有疑问:只需尝试即可。 如果您遇到大量 FP 操作的难题,还可以考虑使用 GPU(GeForce,...),它作为矢量处理器工作。

The speed difference of * / vs + - depends on your processor architecture. In general and with x86 in particular the speed difference has become less with modern processors. * should be close to +, when in doubt: just experiment. If you have a really hard problem with lots of FP operations also consider using your GPU (GeForce, ...) which works as a vector processor.

小傻瓜 2024-08-05 05:45:37

乘法和加法之间的时间差异可能很小。 另一方面,由于其递归性质,除法仍然比乘法慢得多。
在现代 x86 架构上,在进行浮点运算时应考虑 sse 指令,而不是使用 fpu。尽管良好的 C/C++ 编译器应该为您提供使用 sse 而不是 fpu 的选项。

There is probably very little difference in time between multiplication and addition. division on the other hand is still significantly slower then multiplication because of its recursive nature.
on modern x86 architecture sse instructions should be considered when doing floating point operation rather then using the fpu.Though a good C/C++ compiler should give you the option of using sse instead of the fpu.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文