现代硬件上的浮点与整数计算

发布于 2024-08-27 08:34:59 字数 1430 浏览 6 评论 0原文

我正在 C++ 中做一些性能关键的工作,我们目前正在使用整数计算来解决本质上是浮点的问题,因为“它更快”。这会导致很多烦人的问题并添加很多烦人的代码。

现在,我记得读到大约 386 天浮点计算如何如此缓慢,我相信(IIRC)有一个可选的协处理器。但现在 CPU 的复杂度和功能呈指数级增长,如果进行浮点或整数计算,“速度”没有什么区别吗?特别是因为与导致管道停顿或从主内存中获取某些内容相比,实际计算时间很短?

我知道正确的答案是在目标硬件上进行基准测试,测试这个的好方法是什么?我编写了两个小型 C++ 程序,并将它们的运行时间与 Linux 上的“时间”进行了比较,但实际运行时间变化太大(对我在虚拟服务器上运行没有帮助)。不用花一整天的时间运行数百个基准测试、制作图表等,我可以做些什么来对相对速度进行合理的测试吗?有什么想法或想法吗?我完全错了吗?

我使用的程序如下,它们无论如何都不相同:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += rand( ) % 365;
    }
    std::cout << accum << std::endl;

    return 0;
}

程序 2:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += (float)( rand( ) % 365 );
    }
    std::cout << accum << std::endl;

    return 0;
}

编辑:我关心的平台是在桌面 Linux 和 Windows 机器上运行的常规 x86 或 x86-64。

编辑 2(从下面的评论粘贴):我们目前拥有广泛的代码库。实际上,我遇到了这样的概括:我们“不能使用浮点,因为整数计算速度更快” - 我正在寻找一种方法(如果这是真的)来反驳这种概括的假设。我意识到,如果不完成所有工作并事后对其进行分析,就不可能预测确切的结果。

无论如何,感谢您的出色回答和帮助。请随意添加其他内容:)。

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.

Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?

I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?

The programs I used as follows, they are not identical by any means:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += rand( ) % 365;
    }
    std::cout << accum << std::endl;

    return 0;
}

Program 2:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += (float)( rand( ) % 365 );
    }
    std::cout << accum << std::endl;

    return 0;
}

Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.

Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.

Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

女中豪杰 2024-09-03 08:34:59

例如(数字越小速度越快)、

64 位 Intel Xeon X5550 @ 2.67GHz、gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32 位双核 AMD Opteron(tm) 处理器 265 @ 1.81GHz、gcc 3.4 .6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

正如 Dan 指出的,即使你标准化了时钟频率(这可以在流水线设计中本身会产生误导),结果将根据 CPU 架构而有很大差异(个别ALU/FPU 性能以及中每个内核可用的实际ALU/FPU数量超标量设计影响多少独立操作可以并行执行——后一个因素下面的代码不会执行,因为下面的所有操作都是顺序相关的。)

穷人的 FPU/ALU 操作基准:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#else
#include <sys/time.h>
#endif
#include <time.h>
#include <cstdlib>

double
mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  _ftime(&tb);
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
    perror("oops");
  }
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif
}

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);
}

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
}

For example (lesser numbers are faster),

64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)

Poor man's FPU/ALU operation benchmark:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#else
#include <sys/time.h>
#endif
#include <time.h>
#include <cstdlib>

double
mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  _ftime(&tb);
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
    perror("oops");
  }
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif
}

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);
}

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
}
猫瑾少女 2024-09-03 08:34:59

TIL 这变化很大(很多)。以下是使用 gnu 编译器的一些结果(顺便说一句,我还通过在机器上编译进行了检查,xenial 的 gnu g++ 5.4 比 linaro 的 4.6.3 精确得多)

Intel i7 4700MQ xenial

short add: 0.822491
short sub: 0.832757
short mul: 1.007533
short div: 3.459642
long add: 0.824088
long sub: 0.867495
long mul: 1.017164
long div: 5.662498
long long add: 0.873705
long long sub: 0.873177
long long mul: 1.019648
long long div: 5.657374
float add: 1.137084
float sub: 1.140690
float mul: 1.410767
float div: 2.093982
double add: 1.139156
double sub: 1.146221
double mul: 1.405541
double div: 2.093173

Intel i3 2370M 具有类似的结果

short add: 1.369983
short sub: 1.235122
short mul: 1.345993
short div: 4.198790
long add: 1.224552
long sub: 1.223314
long mul: 1.346309
long div: 7.275912
long long add: 1.235526
long long sub: 1.223865
long long mul: 1.346409
long long div: 7.271491
float add: 1.507352
float sub: 1.506573
float mul: 2.006751
float div: 2.762262
double add: 1.507561
double sub: 1.506817
double mul: 1.843164
double div: 2.877484

Intel( R) Celeron(R) 2955U(运行 xenial 的 Acer C720 Chromebook)

short add: 1.999639
short sub: 1.919501
short mul: 2.292759
short div: 7.801453
long add: 1.987842
long sub: 1.933746
long mul: 2.292715
long div: 12.797286
long long add: 1.920429
long long sub: 1.987339
long long mul: 2.292952
long long div: 12.795385
float add: 2.580141
float sub: 2.579344
float mul: 3.152459
float div: 4.716983
double add: 2.579279
double sub: 2.579290
double mul: 3.152649
double div: 4.691226

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2(运行可靠)

short add: 1.094323
short sub: 1.095886
short mul: 1.356369
short div: 4.256722
long add: 1.111328
long sub: 1.079420
long mul: 1.356105
long div: 7.422517
long long add: 1.057854
long long sub: 1.099414
long long mul: 1.368913
long long div: 7.424180
float add: 1.516550
float sub: 1.544005
float mul: 1.879592
float div: 2.798318
double add: 1.534624
double sub: 1.533405
double mul: 1.866442
double div: 2.777649

AMD Opteron(tm) 处理器 4122(精确)

short add: 3.396932
short sub: 3.530665
short mul: 3.524118
short div: 15.226630
long add: 3.522978
long sub: 3.439746
long mul: 5.051004
long div: 15.125845
long long add: 4.008773
long long sub: 4.138124
long long mul: 5.090263
long long div: 14.769520
float add: 6.357209
float sub: 6.393084
float mul: 6.303037
float div: 17.541792
double add: 6.415921
double sub: 6.342832
double mul: 6.321899
double div: 15.362536

这使用来自 http://pastebin.com/Kx8WGUfg 作为 benchmark-pc.c

g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c

我运行了多个通过,但这似乎是一般数字相同的情况。

一个值得注意的例外似乎是 ALU mul 与 FPU mul 的比较。加法和减法似乎略有不同。

以下是上面的图表形式(点击查看大图,越低越好):

上述数据图表

更新以适应 @Peter Cordes

https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc

i7 4700MQ Linux Ubuntu Xenial 64 位(2018 年 3 月 13 日的所有补丁应用)

    short add: 0.773049
    short sub: 0.789793
    short mul: 0.960152
    short div: 3.273668
      int add: 0.837695
      int sub: 0.804066
      int mul: 0.960840
      int div: 3.281113
     long add: 0.829946
     long sub: 0.829168
     long mul: 0.960717
     long div: 5.363420
long long add: 0.828654
long long sub: 0.805897
long long mul: 0.964164
long long div: 5.359342
    float add: 1.081649
    float sub: 1.080351
    float mul: 1.323401
    float div: 1.984582
   double add: 1.081079
   double sub: 1.082572
   double mul: 1.323857
   double div: 1.968488

AMD Opteron(tm) 处理器4122(精确,DreamHost 共享主机)

    short add: 1.235603
    short sub: 1.235017
    short mul: 1.280661
    short div: 5.535520
      int add: 1.233110
      int sub: 1.232561
      int mul: 1.280593
      int div: 5.350998
     long add: 1.281022
     long sub: 1.251045
     long mul: 1.834241
     long div: 5.350325
long long add: 1.279738
long long sub: 1.249189
long long mul: 1.841852
long long div: 5.351960
    float add: 2.307852
    float sub: 2.305122
    float mul: 2.298346
    float div: 4.833562
   double add: 2.305454
   double sub: 2.307195
   double mul: 2.302797
   double div: 5.485736

Intel Xeon E5-2630L v2 @ 2.4GHz(Trusty 64 位,DigitalOcean VPS)

    short add: 1.040745
    short sub: 0.998255
    short mul: 1.240751
    short div: 3.900671
      int add: 1.054430
      int sub: 1.000328
      int mul: 1.250496
      int div: 3.904415
     long add: 0.995786
     long sub: 1.021743
     long mul: 1.335557
     long div: 7.693886
long long add: 1.139643
long long sub: 1.103039
long long mul: 1.409939
long long div: 7.652080
    float add: 1.572640
    float sub: 1.532714
    float mul: 1.864489
    float div: 2.825330
   double add: 1.535827
   double sub: 1.535055
   double mul: 1.881584
   double div: 2.777245

Apple Mac Mini M1

    short add: 0.794701
    short sub: 0.752165
    short mul: 1.002816
    short div: 1.510412
     long add: 0.704235
     long sub: 0.704065
     long mul: 0.891701
     long div: 1.391481
long long add: 0.703971
long long sub: 0.704361
long long mul: 0.890722
long long div: 1.392378
    float add: 1.376483
    float sub: 1.377145
    float mul: 1.377523
    float div: 1.754344
   double add: 1.378830
   double sub: 1.380009
   double mul: 1.378437
   double div: 2.005511

Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz

    short add: 0.625791
    short sub: 0.612076
    short mul: 0.808043
    short div: 3.223206
     long add: 0.598402
     long sub: 0.594910
     long mul: 0.783385
     long div: 4.568725
long long add: 0.594657
long long sub: 0.597185
long long mul: 0.778999
long long div: 4.467567
    float add: 0.972729
    float sub: 0.963480
    float mul: 0.968124
    float div: 1.767378
   double add: 0.973561
   double sub: 0.968600
   double mul: 0.976119
   double div: 1.967776

TIL This varies (a lot). Here are some results using gnu compiler (btw I also checked by compiling on machines, gnu g++ 5.4 from xenial is a hell of a lot faster than 4.6.3 from linaro on precise)

Intel i7 4700MQ xenial

short add: 0.822491
short sub: 0.832757
short mul: 1.007533
short div: 3.459642
long add: 0.824088
long sub: 0.867495
long mul: 1.017164
long div: 5.662498
long long add: 0.873705
long long sub: 0.873177
long long mul: 1.019648
long long div: 5.657374
float add: 1.137084
float sub: 1.140690
float mul: 1.410767
float div: 2.093982
double add: 1.139156
double sub: 1.146221
double mul: 1.405541
double div: 2.093173

Intel i3 2370M has similar results

short add: 1.369983
short sub: 1.235122
short mul: 1.345993
short div: 4.198790
long add: 1.224552
long sub: 1.223314
long mul: 1.346309
long div: 7.275912
long long add: 1.235526
long long sub: 1.223865
long long mul: 1.346409
long long div: 7.271491
float add: 1.507352
float sub: 1.506573
float mul: 2.006751
float div: 2.762262
double add: 1.507561
double sub: 1.506817
double mul: 1.843164
double div: 2.877484

Intel(R) Celeron(R) 2955U (Acer C720 Chromebook running xenial)

short add: 1.999639
short sub: 1.919501
short mul: 2.292759
short div: 7.801453
long add: 1.987842
long sub: 1.933746
long mul: 2.292715
long div: 12.797286
long long add: 1.920429
long long sub: 1.987339
long long mul: 2.292952
long long div: 12.795385
float add: 2.580141
float sub: 2.579344
float mul: 3.152459
float div: 4.716983
double add: 2.579279
double sub: 2.579290
double mul: 3.152649
double div: 4.691226

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2 (running trusty)

short add: 1.094323
short sub: 1.095886
short mul: 1.356369
short div: 4.256722
long add: 1.111328
long sub: 1.079420
long mul: 1.356105
long div: 7.422517
long long add: 1.057854
long long sub: 1.099414
long long mul: 1.368913
long long div: 7.424180
float add: 1.516550
float sub: 1.544005
float mul: 1.879592
float div: 2.798318
double add: 1.534624
double sub: 1.533405
double mul: 1.866442
double div: 2.777649

AMD Opteron(tm) Processor 4122 (precise)

short add: 3.396932
short sub: 3.530665
short mul: 3.524118
short div: 15.226630
long add: 3.522978
long sub: 3.439746
long mul: 5.051004
long div: 15.125845
long long add: 4.008773
long long sub: 4.138124
long long mul: 5.090263
long long div: 14.769520
float add: 6.357209
float sub: 6.393084
float mul: 6.303037
float div: 17.541792
double add: 6.415921
double sub: 6.342832
double mul: 6.321899
double div: 15.362536

This uses code from http://pastebin.com/Kx8WGUfg as benchmark-pc.c

g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c

I've run multiple passes, but this seems to be the case that general numbers are the same.

One notable exception seems to be ALU mul vs FPU mul. Addition and subtraction seem trivially different.

Here is the above in chart form (click for full size, lower is faster and preferable):

Chart of above data

Update to accomodate @Peter Cordes

https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc

i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)

    short add: 0.773049
    short sub: 0.789793
    short mul: 0.960152
    short div: 3.273668
      int add: 0.837695
      int sub: 0.804066
      int mul: 0.960840
      int div: 3.281113
     long add: 0.829946
     long sub: 0.829168
     long mul: 0.960717
     long div: 5.363420
long long add: 0.828654
long long sub: 0.805897
long long mul: 0.964164
long long div: 5.359342
    float add: 1.081649
    float sub: 1.080351
    float mul: 1.323401
    float div: 1.984582
   double add: 1.081079
   double sub: 1.082572
   double mul: 1.323857
   double div: 1.968488

AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)

    short add: 1.235603
    short sub: 1.235017
    short mul: 1.280661
    short div: 5.535520
      int add: 1.233110
      int sub: 1.232561
      int mul: 1.280593
      int div: 5.350998
     long add: 1.281022
     long sub: 1.251045
     long mul: 1.834241
     long div: 5.350325
long long add: 1.279738
long long sub: 1.249189
long long mul: 1.841852
long long div: 5.351960
    float add: 2.307852
    float sub: 2.305122
    float mul: 2.298346
    float div: 4.833562
   double add: 2.305454
   double sub: 2.307195
   double mul: 2.302797
   double div: 5.485736

Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)

    short add: 1.040745
    short sub: 0.998255
    short mul: 1.240751
    short div: 3.900671
      int add: 1.054430
      int sub: 1.000328
      int mul: 1.250496
      int div: 3.904415
     long add: 0.995786
     long sub: 1.021743
     long mul: 1.335557
     long div: 7.693886
long long add: 1.139643
long long sub: 1.103039
long long mul: 1.409939
long long div: 7.652080
    float add: 1.572640
    float sub: 1.532714
    float mul: 1.864489
    float div: 2.825330
   double add: 1.535827
   double sub: 1.535055
   double mul: 1.881584
   double div: 2.777245

Apple Mac Mini M1

    short add: 0.794701
    short sub: 0.752165
    short mul: 1.002816
    short div: 1.510412
     long add: 0.704235
     long sub: 0.704065
     long mul: 0.891701
     long div: 1.391481
long long add: 0.703971
long long sub: 0.704361
long long mul: 0.890722
long long div: 1.392378
    float add: 1.376483
    float sub: 1.377145
    float mul: 1.377523
    float div: 1.754344
   double add: 1.378830
   double sub: 1.380009
   double mul: 1.378437
   double div: 2.005511

Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz

    short add: 0.625791
    short sub: 0.612076
    short mul: 0.808043
    short div: 3.223206
     long add: 0.598402
     long sub: 0.594910
     long mul: 0.783385
     long div: 4.568725
long long add: 0.594657
long long sub: 0.597185
long long mul: 0.778999
long long div: 4.467567
    float add: 0.972729
    float sub: 0.963480
    float mul: 0.968124
    float div: 1.767378
   double add: 0.973561
   double sub: 0.968600
   double mul: 0.976119
   double div: 1.967776
千鲤 2024-09-03 08:34:59

唉,我只能给你一个“视情况而定”的答案......

根据我的经验,性能有很多很多变量......尤其是在整数和整数之间。浮点数学。由于不同的处理器具有不同的“管道”长度,因此不同处理器之间的差异很大(即使在同一系列中,例如 x86)。此外,某些操作通常非常简单(例如加法)并且通过处理器具有加速路线,而其他操作(例如除法)则需要更长的时间。

另一个大变量是数据所在的位置。如果您只需要添加几个值,那么所有数据都可以驻留在缓存中,并可以在其中快速发送到 CPU。缓存中已经有数据的非常非常慢的浮点操作将比需要从系统内存复制整数的整数操作快很多倍。

我假设您问这个问题是因为您正在开发性能关键型应用程序。如果您正在针对 x86 架构进行开发,并且需要额外的性能,您可能需要考虑使用 SSE 扩展。这可以大大加快单精度浮点运算的速度,因为可以同时对多个数据执行相同的操作,而且还有一个单独的寄存器组用于 SSE 操作。 (我注意到在你的第二个例子中你使用了“float”而不是“double”,让我认为你正在使用单精度数学)。

*注意:使用旧的 MMX 指令实际上会减慢程序速度,因为这些旧指令实际上使用与 FPU 相同的寄存器,因此不可能同时使用 FPU 和 MMX。

Alas, I can only give you an "it depends" answer...

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

下雨或天晴 2024-09-03 08:34:59

定点和浮点数学之间的实际速度可能存在显着差异,但 ALU 与 FPU 的理论最佳情况吞吐量完全无关。相反,您的架构上的整数和浮点寄存器(实际寄存器,而不是寄存器名称)的数量,这些寄存器不被您的计算使用(例如用于循环控制),适合缓存行的每种类型的元素数量,考虑到整数与浮点数学的不同语义,可能进行优化——这些影响将占主导地位。算法的数据依赖性在这里发挥着重要作用,因此任何一般比较都无法预测问题的性能差距。

例如,整数加法是可交换的,因此如果编译器看到像用于基准测试的循环(假设随机数据是提前准备好的,这样就不会模糊结果),它可以展开循环并计算部分和没有依赖项,然后在循环终止时添加它们。但是对于浮点,编译器必须按照您请求的相同顺序执行操作(其中有序列点,因此编译器必须保证相同的结果,这不允许重新排序),因此每个加法都强烈依赖于上一个的结果。

您也可能一次在缓存中放入更多的整数操作数。因此,即使在 FPU 理论上具有更高吞吐量的机器上,定点版本的性能也可能比浮点版本高出一个数量级。

There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.

For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.

You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.

为人所爱 2024-09-03 08:34:59

加法比 rand 快得多,所以你的程序(尤其是)没用。

您需要识别性能热点并逐步修改您的程序。听起来您的开发环境存在问题,需要首先解决。对于一个小问题集,是否无法在 PC 上运行您的程序?

一般来说,尝试使用整数算术进行 FP 作业会导致速度变慢。

Addition is much faster than rand, so your program is (especially) useless.

You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?

Generally, attempting FP jobs with integer arithmetic is a recipe for slow.

森罗 2024-09-03 08:34:59

需要考虑的两点 -

现代硬件可以重叠指令、并行执行它们并对它们重新排序,以充分利用硬件。而且,任何重要的浮点程序也可能有重要的整数工作,即使它只是计算数组的索引、循环计数器等。所以即使你有一个缓慢的浮点指令,它也可能在单独的硬件上运行与一些整数工作重叠。我的观点是,即使浮点指令比整数指令慢,您的整个程序也可能运行得更快,因为它可以利用更多的硬件。

与往常一样,唯一确定的方法是分析您的实际程序。

第二点是,现在大多数 CPU 都具有用于浮点的 SIMD 指令,可以同时对多个浮点值进行操作。例如,您可以将 4 个浮点数加载到单个 SSE 寄存器中,并对它们并行执行 4 次乘法。如果您可以重写部分代码以使用 SSE 指令,那么它似乎会比整数版本更快。 Visual c++ 提供了编译器内部函数来执行此操作,请参阅 http ://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx 了解一些信息。

Two points to consider -

Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.

As always, the only way to be sure is to profile your actual program.

Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx for some information.

逆光下的微笑 2024-09-03 08:34:59

除非您正在编写每秒被调用数百万次的代码(例如,在图形应用程序中在屏幕上绘制一条线),否则整数与浮点运算很少是瓶颈。

解决效率问题的第一步通常是分析代码以了解运行时真正花费在哪里。用于此目的的 Linux 命令是 gprof。

编辑:

虽然我认为您始终可以使用整数和浮点数实现线条绘制算法,但请多次调用它并查看是否有区别:

http://en.wikipedia.org/wiki/Bresenham's_algorithm

Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.

The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is gprof.

Edit:

Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:

http://en.wikipedia.org/wiki/Bresenham's_algorithm

给妤﹃绝世温柔 2024-09-03 08:34:59

如果没有余数运算,浮点版本会慢得多。由于所有加法都是连续的,因此 cpu 将无法并行求和。延迟将是至关重要的。 FPU 加法延迟通常为 3 个周期,而整数加法为 1 个周期。然而,余数运算符的除法器可能是关键部分,因为它在现代 CPU 上并未完全流水线化。因此,假设除法/取余指令将消耗大部分时间,则由于加法延迟而导致的差异将会很小。

The floating point version will be much slower, if there is no remainder operation. Since all the adds are sequential, the cpu will not be able to parallelise the summation. The latency will be critical. FPU add latency is typically 3 cycles, while integer add is 1 cycle. However, the divider for the remainder operator will probably the critical part, as it is not fully pipelined on modern cpu's. so, assuming the divide/remainder instruction will consume the bulk of the time, the difference due to add latency will be small.

巡山小妖精 2024-09-03 08:34:59

如今,整数运算通常比浮点运算快一点。因此,如果您可以使用整数和浮点进行相同的运算进行计算,请使用整数。然而你说“这会导致很多烦人的问题并添加很多烦人的代码”。听起来您需要更多运算,因为您使用整数算术而不是浮点。在这种情况下,浮点会运行得更快,因为

  • 一旦您需要更多的整数运算,您可能需要更多,因此轻微的速度优势很快就会被额外的运算所吞噬

  • 浮点代码更简单,这意味着编写代码更快,这意味着如果速度至关重要,您可以花更多时间优化代码。

Today, integer operations are usually a little bit faster than floating point operations. So if you can do a calculation with the same operations in integer and floating point, use integer. HOWEVER you are saying "This causes a whole lot of annoying problems and adds a lot of annoying code". That sounds like you need more operations because you use integer arithmetic instead of floating point. In that case, floating point will run faster because

  • as soon as you need more integer operations, you probably need a lot more, so the slight speed advantage is more than eaten up by the additional operations

  • the floating-point code is simpler, which means it is faster to write the code, which means that if it is speed critical, you can spend more time optimising the code.

要走就滚别墨迹 2024-09-03 08:34:59

我运行了一个测试,只是在数字上加 1,而不是 rand()。结果(在 x86-64 上)为:

  • 短:4.260s
  • int:4.020s
  • long long:3.350s
  • float:7.330s
  • double:7.210s

I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:

  • short: 4.260s
  • int: 4.020s
  • long long: 3.350s
  • float: 7.330s
  • double: 7.210s
西瑶 2024-09-03 08:34:59

基于那个非常可靠的“我听说过的事情”,在过去,整数计算比浮点计算快大约 20 到 50 倍,而现在它的速度还不到两倍。

Based of that oh-so-reliable "something I've heard", back in the old days, integer calculation were about 20 to 50 times faster that floating point, and these days it's less than twice as faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文