在斐波那契微基准测试中,与 C 相比,Haskell 的性能得到了提升
我遇到了这个问题,它比较了各种编译器在计算斐波那契数方面的性能天真的方式。
我尝试使用 Haskell 执行此操作,看看它与 C
代码相比如何:
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
return fib (n-1) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
结果:
> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real 0m0.421s
user 0m0.420s
sys 0m0.000s
Haskell:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib n | n < 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = getArgs >>= print . fib . read . head
结果:
> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real 0m1.476s
user 0m1.476s
sys 0m0.000s
分析显示
> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof
fib
需要 100% 的时间和分配,这并不奇怪。我获取了堆的一些配置文件,但不知道它们意味着什么:
> ./fib 40 +RTS -hc
> ./fib 40 +RTS -hd
所以我的问题:我可以做些什么来使这个 Haskell 程序的性能更接近 C,或者这只是GHC 做事的方式这恰好使它在这个微基准测试中变慢? (我并不是要求一种渐进更快的算法来计算小纤维。)
非常感谢。
[编辑]
事实证明,在这种情况下,ghc -O3
比 ghc -O3 -fllvm -optlo-O3
更快。但是 optlo-block-placement 给 LLVM 后端带来了明显的差异:
> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.283s
user 0m1.284s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.449s
user 0m1.448s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.112s
user 0m1.096s
sys 0m0.016s
我想研究这个的原因是因为对于这个程序,C 和 OCaml 都比 Haskell 快得多。我有点无法接受这一点,想了解更多以确保我已经做了我能做的一切:D
> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real 0m0.668s
user 0m0.660s
sys 0m0.008s
I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.
I tried doing this with Haskell to see how it compares to C.
C code:
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
return fib (n-1) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
Result:
> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real 0m0.421s
user 0m0.420s
sys 0m0.000s
Haskell:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib n | n < 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = getArgs >>= print . fib . read . head
Result:
> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real 0m1.476s
user 0m1.476s
sys 0m0.000s
Profiling with
> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof
reveals that fib
takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:
> ./fib 40 +RTS -hc
> ./fib 40 +RTS -hd
So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)
Thank you very much.
[EDIT]
It turned out that ghc -O3
was faster than ghc -O3 -fllvm -optlo-O3
in this case. But optlo-block-placement
made an observable difference for the LLVM backend:
> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.283s
user 0m1.284s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.449s
user 0m1.448s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.112s
user 0m1.096s
sys 0m0.016s
The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D
> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real 0m0.668s
user 0m0.660s
sys 0m0.008s
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
堆配置文件在这里并不是很有趣,因为 GHC 将 fib 编译成仅在堆栈上运行的函数。只需查看配置文件即可...仅分配了 800 字节,这是
main
实现的少量开销。就GHC的核心层面而言,这实际上已经得到了尽可能的优化。不过,低级代码生成是另一回事。让我们快速深入了解 GHC 生成的代码:
这是对堆栈空间的检查。可能 C 不需要,因为它可以让操作系统处理堆栈空间分配。 Haskell 有用户级线程,因此堆栈空间是手动管理的。
与您的代码中的 2 进行比较。
从堆栈中重新加载参数并递减以获取递归调用的参数。重新加载可能是不必要的 - 但不确定它会产生什么影响。
参数和返回地址被压入栈顶,我们直接跳转到标签处进行递归。
参数小于2的情况的代码。返回值是在寄存器中传递的。
底线:一切似乎都按其应有的方式运行,您不太可能通过更改程序从中获得更多收益。自定义堆栈检查是导致速度下降的明显原因,但不确定是否可以将整个时间差异归咎于它。
The heap profile is not very interesting here, as GHC compiles
fib
into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of yourmain
implementation.As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:
This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.
The comparison against 2 from your code.
The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.
The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.
The code for the case that the parameter is less than 2. The return value is passed in a register.
Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.
正如
barsoap
所说,这些似乎是一个非常微弱的“基准”。假设我比较以下几乎同样幼稚的程序:...而在另一个角落...
然后光荣的
ghc
碾压gcc
,而不是太令人惊讶了,真的:现在打开优化,
ghc
的速度加快了一些:但是
gcc
终于得到了线索。我认为解释是 ghc 对公共子表达式消除的谨慎:“(几乎)一切都是表达式”是危险的,并且它表明程序员知道如何使用 lambda。
These seems like a really feeble 'benchmark' as
barsoap
says. Suppose I compare the following almost equally naive programs:... and in the other corner ...
Then the Glorious
ghc
crushesgcc
, not too surprisingly, really:now turning on optimizations,
ghc
picks up a little more speed:but
gcc
finally gets a clue.I think the explanation is the
ghc
's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.GHC 编译这个很好。下一步是对 GHC 后端的输出进行微观优化。在这里使用各种 LLVM 标志会有所帮助。
为此,请使用 ghc-core 检查程序集,并尝试 LLVM 的其他标志以查看得到的结果。
另一种方法是添加少量并行性。
GHC compiles this fine. The next step is to micro-optimize the output from the backend of GHC. Playing with various LLVM flags can help here.
To do this, use ghc-core to inspect the assembly, and try other flags to LLVM to see what you get.
Another approach would be to add a small amount of parallelism.
试试这个:(
这是在一个好的 ole Athlon64 3200+ 上)
您使用的版本是,对于每个
n
,计算fib (n-1)
和fib (n-2)
,即具有大致三角形性质的复杂度。上面的版本是线性的:每个 fib 仅计算一次。不管非 Haskell 编程蜂巢思维似乎是怎么想的,Haskell 不会自动记忆 (无论如何,这通常比好的 ole 动态编程要慢)。Haskell Wiki 上还有更快(使用数学技巧)的斐波那契版本。
将 C 版本更改为非递归版本,我敢打赌,您将看到 Haskell 和 C 具有非常相似的性能。紧密循环更容易优化。
Try this:
(That's on an good ole Athlon64 3200+)
The version you are using is, for every
n
, computingfib (n-1)
andfib (n-2)
, that is, has an complexity of roughly triangular nature. The version above is linear: Each fib is only computed once. Despite what the non-Haskell programming hivemind seems to think, Haskell does not automatically memoise (which would be slower, in general, than good ole dynamic programming, anyway).There's even faster (using maths tricks) version of fibonnaci on the Haskell Wiki.
Change the C version to be non-recursive, and my bet is that you'll see both Haskell and C having very similar performance. Tight loops are just easier to optimise.