ifort 和 gfortran 之间令人困惑的性能差异
最近,我在 Stack 上读到了一篇 帖子关于查找完全平方整数的溢出。因为我想玩这个,所以我编写了以下小程序:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
使用 gfortran -O2 编译时,运行时间为 4.437 秒,使用 -O3 时为 2.657 秒。然后我认为使用 ifort -O2
编译可能会更快,因为它可能有更快的 SQRT
函数,但事实证明运行时间现在是 9.026 秒,而使用 >ifort -O3
相同。我尝试使用Valgrind对其进行分析,Intel编译的程序确实使用了更多的指令。
我的问题是为什么?有没有办法找出差异到底来自哪里?
编辑:
- gfortran 版本 4.6.2 和 ifort 版本 12.0.2
- 时间是通过运行 time ./a.out 获得的,是真实/用户时间(sys 始终几乎为 0),
- 这是在 Linux x86_64 上,gfortran和ifort都是64位构建
- ifort内联所有内容,gfortran仅在-O3处,但后者的汇编代码比ifort更简单,ifort使用xmm注册了很多
- 固定的代码行,在循环之前添加了
NTOT=0
,应该可以解决其他 gfortran 版本的问题
当复杂的 IF
语句被删除时,gfortran 大约需要 4 倍的时间很长时间(10-11 秒)。这是可以预料的,因为该语句大约会丢弃大约 75% 的数字,从而避免对它们进行 SQRT
。另一方面,ifort 仅使用稍微多一点的时间。我的猜测是,当 ifort 尝试优化 IF 语句时,出现了问题。
EDIT2:
我尝试使用 ifort 版本 12.1.2.273 它更快,所以看起来他们修复了这个问题。
Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
When compiling with gfortran -O2
, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2
could be faster since it might have a faster SQRT
function, but it turned out running time was now 9.026 seconds, and with ifort -O3
the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.
My question is why? Is there a way to find out where exactly the difference comes from?
EDITS:
- gfortran version 4.6.2 and ifort version 12.0.2
- times are obtained from running
time ./a.out
and is the real/user time (sys was always almost 0) - this is on Linux x86_64, both gfortran and ifort are 64-bit builds
- ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
- fixed line of code, added
NTOT=0
before loop, should fix issue with other gfortran versions
When the complex IF
statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT
on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF
statement.
EDIT2:
I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您使用什么编译器版本?
有趣的是,它看起来像是从 11.1 到 12.0 的性能回归——例如,对我来说,11.1(ifort -fast square.f90)需要 3.96 秒,而 12.0(相同选项)需要 13.3 秒。
gfortran (4.6.1) (-O3) 仍然更快(3.35s)。
我以前见过这种倒退,尽管没有那么戏剧性。
顺便说一句,将 if 语句替换为
使其在 ifort 12.0 中运行速度提高了一倍,但在 gfortran 和 ifort 11.1 中运行速度较慢。
看起来问题的一部分是 12.0 在尝试向量化方面过于激进:
在 DO 循环之前添加(不更改代码中的任何其他内容)将运行时间减少到 4.0 秒。
另外,还有一个附带好处:如果您有多核 CPU,请尝试在 ifort 命令行中添加 -parallel :)
What compiler versions are you using?
Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s.
gfortran (4.6.1) (-O3) is still faster (3.35s).
I have seen this kind of a regression before, although not quite as dramatic.
BTW, replacing the if statement with
makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.
It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding
right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.
Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)