使用 double 比 float 快吗?

发布于 2024-09-13 18:20:09 字数 127 浏览 6 评论 0原文

双精度值存储更高的精度,并且是浮点数大小的两倍,但是 Intel CPU 是否针对浮点数进行了优化?

也就是说,双精度运算与 +、-、* 和 / 的浮点运算一样快还是更快?

对于 64 位架构,答案会改变吗?

Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats?

That is, are double operations just as fast or faster than float operations for +, -, *, and /?

Does the answer change for 64-bit architectures?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

杀手六號 2024-09-20 18:20:09

没有一个“英特尔 CPU”,特别是在哪些操作相对于其他操作进行优化方面!但大多数在 CPU 级别(特别是在 FPU 内),都可以回答您的问题:

双操作同样快或
+、-、比浮点运算更快
*、和/?

是“是”——在 CPU 内,除法和 sqrt 除外,它们是 doublefloat 慢一些。 (假设您的编译器使用 SSE2 进行标量 FP 数学计算,就像所有 x86-64 编译器一样,以及一些 32 位编译器,具体取决于选项。传统 x87 在寄存器中没有不同的宽度,仅在内存中具有不同的宽度(它在加载/存储时进行转换) ),所以从历史上看,即使 sqrt 和除法对于 double 来说也一样慢)。

例如,Haswell 的 divsd 吞吐量为每 8 到 14 个周期 1 个(取决于数据),但 divss(标量单)吞吐量为每 7 个周期 1 个。 x87 fdiv 是 8 到 18 个周期的吞吐量。 (来自 https://agner.org/optimize/ 的数字。延迟与除法的吞吐量相关,但更高比吞吐量数字。)

许多库函数的 float 版本,例如 logf(float)sinf(float) 也将是比 log(double)sin(double) 更快,因为它们需要的精度位数要少得多。他们可以使用较少项的多项式近似来获得 floatdouble 的完全精度


但是,每个数字占用两倍的内存显然意味着缓存负载更重,内存带宽更大,以填充和溢出 RAM 中的缓存行;您关心浮点运算性能的时间是在执行大量此类操作时,因此内存和缓存的考虑因素至关重要。

@Richard的回答指出,还有其他方法可以执行 FP 操作(SSE / SSE2 指令;旧的 MMX 仅支持整数),特别适合对大量数据(“SIMD”、单指令/多数据)进行简单操作,其中每个向量寄存器可以打包 4 个单精度浮点数或仅 2 个双精度,所以这种效果会更加明显。

最后,您确实必须进行基准测试,但我的预测是,对于合理的(即;-)基准测试,您会发现坚持使用单精度的优势(当然假设您不这样做)不需要额外的精度!-)。

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

are double operations just as fast or
faster than float operations for +, -,
*, and /?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double


However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.

@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

独行侠 2024-09-20 18:20:09

如果所有浮点计算都在 FPU 内执行,那么,不,double 计算和 float 计算之间没有区别,因为浮点运算实际上是执行的FPU 堆栈中具有 80 位精度。 FPU 堆栈的条目会根据需要进行舍入,以将 80 位浮点格式转换为 double 或 float 浮点格式。将 sizeof(double) 字节移入 RAM 或从 RAM 移出 sizeof(float) 字节是速度上的唯一区别。

但是,如果您有可矢量化计算,则可以使用 SSE 扩展在运行两个 double 计算的同时运行四个 float 计算。因此,巧妙地使用 SSE 指令和 XMM 寄存器可以在仅使用浮点型的计算中实现更高的吞吐量。

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a double calculation and a float calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the double or float floating-point format. Moving sizeof(double) bytes to/from RAM versus sizeof(float) bytes is the only difference in speed.

If, however, you have a vectorizable computation, then you can use the SSE extensions to run four float calculations in the same time as two double calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.

榆西 2024-09-20 18:20:09

另一点需要考虑的是您是否使用 GPU(显卡)。我正在处理一个数字密集型的项目,但我们不需要双重提供的精度。我们使用 GPU 卡来帮助进一步加快处理速度。 CUDA GPU需要特殊的封装来支持双倍,并且GPU上的本地RAM数量相当快,但相当稀缺。因此,使用 float 也会使我们可以在 GPU 上存储的数据量增加一倍。

还有一点是记忆。浮点数占用的 RAM 是双倍占用的一半。如果您正在处理非常大的数据集,这可能是一个非常重要的因素。如果使用 double 意味着你必须缓存到磁盘而不是纯 RAM,那么你的差异将是巨大的。

因此对于我正在使用的应用程序来说,差异非常重要。

Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.

Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.

So for the application I am working with, the difference is quite important.

不美如何 2024-09-20 18:20:09

我只想添加到相同指令多数据的 __m256? 系列(SIMD) C++ 内部函数在 任一 4 double 上运行并行(例如_mm256_add_pd),或并行8 浮点(例如_mm256_add_ps)。

我不确定这是否可以转化为实际加速,但似乎有可能在使用 SIMD 时每条指令处理 2 倍的浮点数。

I just want to add to the already existing great answers that the __m256? family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 double s in parallel (e.g. _mm256_add_pd), or 8 floats in parallel (e.g. _mm256_add_ps).

I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.

沐歌 2024-09-20 18:20:09

在加3.3 2000000000次的实验中,结果是:

Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double

所以double更快,并且是C和C++中的默认值。它更具可移植性,并且是所有 C 和 C++ 库函数的默认值。 Alos double 的精度明显高于 float。

甚至 Stroustrup 也建议使用双精度而不是浮点:

“单精度、双精度和扩展精度的确切含义是由实现定义的。为选择很重要的问题选择正确的精度需要对浮点计算有深入的了解。如果你不这样做没有这种理解,寻求建议,花时间学习,或者使用双重并希望得到最好的结果。”

也许唯一应该使用 float 而不是 double 的情况是在具有现代 gcc 的 64 位硬件上。因为浮动较小; double 是 8 个字节,float 是 4 个字节。

In experiments of adding 3.3 for 2000000000 times, results are:

Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double

So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.

Even Stroustrup recommends double over float:

"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."

Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.

望笑 2024-09-20 18:20:09

唯一真正有用的答案是:只有你自己才能知道。您需要对您的场景进行基准测试。指令和记忆模式的微小变化可能会产生重大影响。

如果您使用的是 FPU 或 SSE 类型的硬件,这肯定很重要(前者以 80 位扩展精度完成所有工作,因此 double 会更接近;后者是本机 32 位,即 float)。

更新:s/MMX/SSE/,如另一个答案中所述。

The only really useful answer is: only you can tell. You need to benchmark your scenarios. Small changes in instruction and memory patterns could have a significant impact.

It will certainly matter if you are using the FPU or SSE type hardware (former does all its work with 80bit extended precision, so double will be closer; later is natively 32bit, i.e. float).

Update: s/MMX/SSE/ as noted in another answer.

卖梦商人 2024-09-20 18:20:09

Alex Martelli的答案已经足够好了,但我想提一下一个错误但有点流行的测试方法,可能误导了一些人:

#include <cstdio>
#include <ctime>
int main() {
  const auto start_clock = clock();
  float a = 0;
  for (int i = 0; i < 256000000; i++) {
    // bad latency benchmark that includes as much division as other operations
    a += 0.11;  // note the implicit conversions of a to double to match 0.11
    a -= 0.13;  // rather than 0.11f
    a *= 0.17;
    a /= 0.19;
  }
  printf("c++ float duration = %.3f\n", 
    (double)(clock() - start_clock) / CLOCKS_PER_SEC);
  printf("%.3f\n", a);
  return 0;
}

这是错误的! C++ 默认使用 double,如果将 += 0.11 替换为 += 0.11f,在 x86 CPU 上,float 通常会比 double 更快。

顺便说一下,在现代 SSE 指令集上,float 和 double 具有相同的速度 除了除法运算,在CPU核心本身。如果您有数组,则 float 较小可能会减少缓存未命中的情况。

如果编译器可以自动向量化,则浮点向量在每条指令上处理的元素数量是双精度向量的两倍。

Alex Martelli's answer is good enough, but I want to mention a wrong but somewhat popular test method that may have misled some people:

#include <cstdio>
#include <ctime>
int main() {
  const auto start_clock = clock();
  float a = 0;
  for (int i = 0; i < 256000000; i++) {
    // bad latency benchmark that includes as much division as other operations
    a += 0.11;  // note the implicit conversions of a to double to match 0.11
    a -= 0.13;  // rather than 0.11f
    a *= 0.17;
    a /= 0.19;
  }
  printf("c++ float duration = %.3f\n", 
    (double)(clock() - start_clock) / CLOCKS_PER_SEC);
  printf("%.3f\n", a);
  return 0;
}

It's wrong! C++ default use double, if you replace += 0.11 by += 0.11f, float will usually be faster then double, on x86 CPU.

By the way, on modern SSE instruction set, both float and double have same speed except of division operation, in the CPU core itself. float being smaller may have fewer cache misses if you have arrays of them.

And if the compiler can auto-vectorize, float vectors work on twice as many elements per instruction as double.

何时共饮酒 2024-09-20 18:20:09

以前的答案缺少一个可能导致 float 和 double 之间存在较大差异(> 4 X)的因素:非正规。
避免在 C++ 中使用非正规值
由于 double 具有更宽的正常范围,因此对于包含许多小值的特定问题,float 落入非正常范围的可能性比 double 更高,因此在这种情况下 float 可能比 double 慢得多。

Previous answers missing a factor that may cause big diff(> 4 X) between float and double: denormal.
Avoiding denormal values in C++
Since double have a much wider normal range, for a specific problem that contains many small values, There is much higher probability to fall into denormal range with float than with double, so float could be much slower than double in this case.

享受孤独 2024-09-20 18:20:09

浮点通常是通用 CPU 的扩展。因此,速度将取决于所使用的硬件平台。如果平台有浮点支持,如果有任何差异我会感到惊讶。

Floating point is normally an extension to one's general purpose CPU. The speed will therefore be dependent on the hardware platform used. If the platform has floating point support, I will be surprised if there is any difference.

美男兮 2024-09-20 18:20:09

此外,还有一些基准测试的真实数据可供一睹:

For Intel 3770k, GCC 9.3.0 -O2 [3]
Run on (8 X 3503 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FloatCreation               0.281 ns        0.281 ns   1000000000
BM_DoubleCreation              0.284 ns        0.281 ns   1000000000
BM_Vector3FCopy                0.558 ns        0.562 ns   1000000000
BM_Vector3DCopy                 5.61 ns         5.62 ns    100000000
BM_Vector3F_CopyDefault        0.560 ns        0.546 ns   1000000000
BM_Vector3D_CopyDefault         5.57 ns         5.56 ns    112178768
BM_Vector3F_Copy123            0.841 ns        0.817 ns    897430145
BM_Vector3D_Copy123             5.59 ns         5.42 ns    112178768
BM_Vector3F_Add                0.841 ns        0.834 ns    897430145
BM_Vector3D_Add                 5.59 ns         5.46 ns    100000000
BM_Vector3F_Mul                0.842 ns        0.782 ns    897430145
BM_Vector3D_Mul                 5.60 ns         5.56 ns    112178768
BM_Vector3F_Compare            0.840 ns        0.800 ns    897430145
BM_Vector3D_Compare             5.61 ns         5.62 ns    100000000
BM_Vector3F_ARRAY_ADD           3.25 ns         3.29 ns    213673844        
BM_Vector3D_ARRAY_ADD           3.13 ns         3.06 ns    224357536        

对 3 个 float(F) 或 3 个 double(D) 的操作进行了比较和
- BM_Vector3XCopy 是 (1,2,3) 初始化向量的纯副本,在复制之前不重复,
- BM_Vector3X_CopyDefault 具有默认初始化,每个副本都会重复,
- BM_Vector3X_Copy123 重复初始化 (1,2,3),

  • 加法/乘法 每个初始化 3 个向量 (1,2,3) 并将第一个和第二个向量加/乘到第三个向量中,
  • 比较检查两个初始化向量是否相等向量,

  • ARRAY_ADD 通过 std 求和向量(1,2,3) + 向量(3,4,5) + 向量(6,7,8): :valarray 在我的例子中会导致 SSE 指令。

请记住,这些是孤立的测试,结果因编译器设置、机器不同或架构不同而不同。
对于缓存(问题)和现实世界的用例,这可能完全不同。因此,理论可能与现实存在很大差异。
找出答案的唯一方法是进行实际测试,例如使用 google-benchmark[1] 并检查编译器输出的结果以查找特定问题的解决方案[2]。

  1. https://github.com/google/benchmark
  2. https://sourceware.org/binutils/docs/binutils/objdump.html -> objdump -S
  3. https://github.com/Jedzia/oglTemplate /blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp

In addition some real data of a benchmark to get a glimpse:

For Intel 3770k, GCC 9.3.0 -O2 [3]
Run on (8 X 3503 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FloatCreation               0.281 ns        0.281 ns   1000000000
BM_DoubleCreation              0.284 ns        0.281 ns   1000000000
BM_Vector3FCopy                0.558 ns        0.562 ns   1000000000
BM_Vector3DCopy                 5.61 ns         5.62 ns    100000000
BM_Vector3F_CopyDefault        0.560 ns        0.546 ns   1000000000
BM_Vector3D_CopyDefault         5.57 ns         5.56 ns    112178768
BM_Vector3F_Copy123            0.841 ns        0.817 ns    897430145
BM_Vector3D_Copy123             5.59 ns         5.42 ns    112178768
BM_Vector3F_Add                0.841 ns        0.834 ns    897430145
BM_Vector3D_Add                 5.59 ns         5.46 ns    100000000
BM_Vector3F_Mul                0.842 ns        0.782 ns    897430145
BM_Vector3D_Mul                 5.60 ns         5.56 ns    112178768
BM_Vector3F_Compare            0.840 ns        0.800 ns    897430145
BM_Vector3D_Compare             5.61 ns         5.62 ns    100000000
BM_Vector3F_ARRAY_ADD           3.25 ns         3.29 ns    213673844        
BM_Vector3D_ARRAY_ADD           3.13 ns         3.06 ns    224357536        

where operations on 3 float(F) or 3 double(D) are compared and
- BM_Vector3XCopy is the pure copy of a (1,2,3) initialized vector not repeated before copy,
- BM_Vector3X_CopyDefault with default initialization repeated every copy,
- BM_Vector3X_Copy123 with repeated initialization of (1,2,3),

  • Add/Mul Each initialize 3 vectors(1,2,3) and add/multiplicate the first and second into the third,
  • Compare Checks for equality of two initialized vectors,

  • ARRAY_ADD Sums up vector(1,2,3) + vector(3,4,5) + vector(6,7,8) via std::valarray what in my case leads to SSE instructions.

Remember that these are isolated tests and the results differ with compiler settings, from machine to machine or architecture to architecture.
With caching (issues) and real world use-cases this may be completely different. So the theory can greatly differ from reality.
The only way to find out is a practical test such as with google-benchmark[1] and checking the result of the compiler output for your particular problem solution[2].

  1. https://github.com/google/benchmark
  2. https://sourceware.org/binutils/docs/binutils/objdump.html -> objdump -S
  3. https://github.com/Jedzia/oglTemplate/blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文