c++的实际计算复杂度平方根()

发布于 2024-11-27 06:26:47 字数 225 浏览 5 评论 0原文

编辑之间的 CPU 周期(或者本质上是“速度”)有什么区别

 x /= y;

 #include <cmath>
 x = sqrt(y);

:我知道这些操作并不等效,我只是任意建议 x /= y 作为x = sqrt(y) 的基准

What is the difference in CPU cycles (or, in essence, in 'speed') between

 x /= y;

and

 #include <cmath>
 x = sqrt(y);

EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y as a benchmark for x = sqrt(y)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

讽刺将军 2024-12-04 06:26:47

您问题的答案取决于您的目标平台。假设您使用的是最常见的 x86 cpu,我可以给您这个链接 http://instlatx64.atw.hu/ 这是测量指令延迟的集合(CPU 在有参数后需要多长时间才能获得结果)以及它们如何针对许多 x86 和 x86_64 处理器进行流水线处理。如果您的目标不是 x86,您可以尝试自行测量成本或查阅 CPU 文档。

首先,您应该获得操作的反汇编程序(从编译器,例如 gcc:gcc file.c -O3 -S -o file.asm 或通过编译的二进制文件的反汇编,例如在调试器的帮助下)。
请记住,在您的操作中存在加载和存储一个值,必须另外计算该值。

以下是来自 friweb.hu 的两个示例:

对于 Core 2 Duo E6700,SQRT 延迟 (L)(x87、SSE 和 SSE2 版本)

  • 32 位浮点为 29 个刻度; 64 位双精度数为 58 个刻度; 80 位长双精度数为 69 个刻度;

DIVIDE(浮点数):

  • 32 位为 18 个刻度; 64 位为 32 个刻度; 80 位为 38 个刻度

对于较新的处理器,成本较低,并且 DIV 和 SQRT 几乎相同,例如对于 Sandy Bridge Intel CPU:

32位浮点 SQRT

  • 为 14 个刻度; 64 位为 21 个刻度; 为 24 个刻度,

80位浮点除法

  • 32 位浮点除法为 14 个刻度; 64 位为 22 个刻度; 80 位

SQRT 为 24 个刻度,甚至 32 位更快一个刻度。

因此:对于较旧的 CPU,sqrt 本身比 fdiv 慢 30-50%;对于较新的 CPU,成本是相同的。
对于较新的 CPU,这两种操作的成本都比旧 CPU 低;
对于较长的浮动格式,您需要更多时间;例如,对于 64 位,您需要的时间是 32 位的 2 倍;但80位比64位便宜。

此外,较新的 CPU 具有与标量 (x87) 相同速度的向量运算(SSE、SSE2、AVX)。向量由 2-4 个相同类型的数据组成。如果您可以调整循环以使用相同的操作处理多个 FP 值,您将从 CPU 获得更高的性能。

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.

Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm or via dissasembly of compiled binary, e.g. with help of debugger).
Remember, that In your operation there is loading and storing a value, which must be counted additionally.

Here are two examples from friweb.hu:

For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)

  • 29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;

of DIVIDE (of floating point numbers):

  • 18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit

For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:

Floating-point SQRT is

  • 14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit

Floating-point DIVIDE is

  • 14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit

SQRT even a tick faster for 32bit.

So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same.
For newer CPU, cost of both operations become lower that it was for older CPUs;
For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.

Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

樱娆 2024-12-04 06:26:47

如果平方根函数没有在特殊的硬件或软件中实现,大多数库函数将使用牛顿法来计算它,该方法二次收敛。

牛顿法是一种迭代方法:您进行初始猜测,计算试验结果,并将其用于下一次猜测。不断重复,直到您认为得到的结果“足够接近”。碰巧你可以用平方根证明你需要多少次迭代。每次经过该周期,您都会获得另外两位数的精度,因此大多数实现将在 8-9 个周期内收敛到双精度数的精度限制。

如果你仔细阅读这篇,你会发现迭代牛顿法做了两个减法每次迭代一次乘法和一次除法。

If the square root function isn't implemented in special hardware or software, most library functions would calculate it using Newton's method, which converges quadratically.

Newton's method is an iterative method: you make an initial guess, calculate a trial result, and use that for the next guess. You repeat until you think you have a result that's "close enough." It so happens that you can prove how many iterations you need with square root. Every time through the cycle you get another two digits of accuracy, so most implementations will converge to the precision limit of doubles in 8-9 cycles.

If you read this carefully, you'll see that the iterative Newton's method is doing two subtractions, one multiplication, and one division per iteration.

感悟人生的甜 2024-12-04 06:26:47

一般经验法则:浮点除法和平方根都被认为是慢速运算(与加法或乘法等快速运算相比)。与除法相比,平方根预计具有大致相同的速度或稍慢(即性能降低约 1 倍 - 2 倍)。例如 Pentium Pro

除法和平方根的延迟分别为 18 至 36 和 29 至 69 个周期

为了获得更准确的答案,您需要深入研究平台的架构手册或执行基准测试。

注意:许多现代平台还提供反平方根,其速度与 sqrt 大致相同,但通常更有用(例如,通过使用 invsqrt,您可以计算 sqrt 和 div,每个乘法一次)。

As a general rule of thumb: Both floating point division and square root are considered as slow operation (compared to fast ones like addition or multiplication). Square root can be expect to be approximately the same speed or somewhat slower (i.e. approx. 1x - 2x lower performance) compared to a division. E.g. on Pentium Pro

Division and square root have a latency of 18 to 36 and 29 to 69 cycles, respectively

To get more accurate answer, you need to dig into architecture manual for your platform or perform a benchmark.

Note: many modern platforms also offer inverse square root, which has the speed approximately the same as sqrt, but is often more useful (e.g. by having invsqrt you can compute both sqrt and div with one multiplication for each).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文