在斐波那契微基准测试中，与 C 相比，Haskell 的性能得到了提升

发布于 2024-11-24 06:40:12 字数 2346 浏览 2 评论 0原文

我遇到了这个问题，它比较了各种编译器在计算斐波那契数方面的性能天真的方式。

我尝试使用 Haskell 执行此操作，看看它与 C

代码相比如何：

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

结果：

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell：

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

结果：

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

分析显示

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

fib 需要 100% 的时间和分配，这并不奇怪。我获取了堆的一些配置文件，但不知道它们意味着什么：

> ./fib 40 +RTS -hc

在此处输入图像描述

> ./fib 40 +RTS -hd

在此处输入图像描述

所以我的问题：我可以做些什么来使这个 Haskell 程序的性能更接近 C，或者这只是GHC 做事的方式这恰好使它在这个微基准测试中变慢？（我并不是要求一种渐进更快的算法来计算小纤维。）

非常感谢。

[编辑]

事实证明，在这种情况下，ghc -O3 比 ghc -O3 -fllvm -optlo-O3 更快。但是 optlo-block-placement 给 LLVM 后端带来了明显的差异：

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

我想研究这个的原因是因为对于这个程序，C 和 OCaml 都比 Haskell 快得多。我有点无法接受这一点，想了解更多以确保我已经做了我能做的一切：D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s

原文

I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Result:

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell:

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

Result:

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

Profiling with

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:

> ./fib 40 +RTS -hc

enter image description here

> ./fib 40 +RTS -hd

enter image description here

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨落星ぅ辰 2024-12-01 06:40:12

堆配置文件在这里并不是很有趣，因为 GHC 将 fib 编译成仅在堆栈上运行的函数。只需查看配置文件即可...仅分配了 800 字节，这是 main 实现的少量开销。

就GHC的核心层面而言，这实际上已经得到了尽可能的优化。不过，低级代码生成是另一回事。让我们快速深入了解 GHC 生成的代码：

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

这是对堆栈空间的检查。可能 C 不需要，因为它可以让操作系统处理堆栈空间分配。 Haskell 有用户级线程，因此堆栈空间是手动管理的。

    cmpl $2,0(%ebp)
    jl .Lc1RO

与您的代码中的 2 进行比较。

    movl 0(%ebp),%eax
    decl %eax

从堆栈中重新加载参数并递减以获取递归调用的参数。重新加载可能是不必要的 - 但不确定它会产生什么影响。

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

参数和返回地址被压入栈顶，我们直接跳转到标签处进行递归。

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

参数小于2的情况的代码。返回值是在寄存器中传递的。

底线：一切似乎都按其应有的方式运行，您不太可能通过更改程序从中获得更多收益。自定义堆栈检查是导致速度下降的明显原因，但不确定是否可以将整个时间差异归咎于它。

The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.

    cmpl $2,0(%ebp)
    jl .Lc1RO

The comparison against 2 from your code.

    movl 0(%ebp),%eax
    decl %eax

The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

回复收藏 0 原文

我的影子我的梦 2024-12-01 06:40:12

正如 barsoap 所说，这些似乎是一个非常微弱的“基准”。假设我比较以下几乎同样幼稚的程序：

module Main where
import System.Environment (getArgs)

fib ::  Int ->  Int
fib 0 = 1
fib 1 = 1
fib 2 = 2 
fib n = (\x y -> x + y + y )  (fib (n-3))  (fib (n-2) )

main = getArgs >>= print . fib . read . head

...而在另一个角落...

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  if (n < 3) return n;
  return (fib (n-3) + fib (n-2)) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

然后光荣的ghc碾压gcc，而不是太令人惊讶了，真的：

$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main             ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141

real    0m0.013s
user    0m0.007s
sys 0m0.003s

$ gcc fib.c -o fibc
$ time ./fibc 40
165580141

real    0m1.495s
user    0m1.483s
sys 0m0.003s

现在打开优化，ghc 的速度加快了一些：

$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141

real    0m0.007s
user    0m0.002s
sys 0m0.004s

但是 gcc 终于得到了线索。

$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141

real    0m0.007s
user    0m0.004s
sys 0m0.002s

我认为解释是 ghc 对公共子表达式消除的谨慎：“（几乎）一切都是表达式”是危险的，并且它表明程序员知道如何使用 lambda。

These seems like a really feeble 'benchmark' as barsoap says. Suppose I compare the following almost equally naive programs:

module Main where
import System.Environment (getArgs)

fib ::  Int ->  Int
fib 0 = 1
fib 1 = 1
fib 2 = 2 
fib n = (\x y -> x + y + y )  (fib (n-3))  (fib (n-2) )

main = getArgs >>= print . fib . read . head

... and in the other corner ...

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  if (n < 3) return n;
  return (fib (n-3) + fib (n-2)) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Then the Glorious ghc crushes gcc, not too surprisingly, really:

$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main             ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141

real    0m0.013s
user    0m0.007s
sys 0m0.003s

$ gcc fib.c -o fibc
$ time ./fibc 40
165580141

real    0m1.495s
user    0m1.483s
sys 0m0.003s

now turning on optimizations, ghc picks up a little more speed:

$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141

real    0m0.007s
user    0m0.002s
sys 0m0.004s

but gcc finally gets a clue.

$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141

real    0m0.007s
user    0m0.004s
sys 0m0.002s

I think the explanation is the ghc's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.

回复收藏 0 原文

风月客 2024-12-01 06:40:12

GHC 编译这个很好。下一步是对 GHC 后端的输出进行微观优化。在这里使用各种 LLVM 标志会有所帮助。

为此，请使用 ghc-core 检查程序集，并尝试 LLVM 的其他标志以查看得到的结果。

另一种方法是添加少量并行性。

回复收藏 0 原文

过去的过去 2024-12-01 06:40:12

试试这个：（

fibs :: [Integer]
fibs = 0:1:zipWith (+) fibs (tail fibs)

fib n = fibs !! n

$ time ./fib 10000
  33644[...]6875
  ./fib 10000  0.03s user 0.01s system 89% cpu 0.039 total

这是在一个好的 ole Athlon64 3200+ 上）

您使用的版本是，对于每个 n，计算 fib (n-1) 和 fib (n-2)，即具有大致三角形性质的复杂度。上面的版本是线性的：每个 fib 仅计算一次。不管非 Haskell 编程蜂巢思维似乎是怎么想的，Haskell 不会自动记忆（无论如何，这通常比好的 ole 动态编程要慢）。

Haskell Wiki 上还有更快（使用数学技巧）的斐波那契版本。

将 C 版本更改为非递归版本，我敢打赌，您将看到 Haskell 和 C 具有非常相似的性能。紧密循环更容易优化。

Try this:

fibs :: [Integer]
fibs = 0:1:zipWith (+) fibs (tail fibs)

fib n = fibs !! n

$ time ./fib 10000
  33644[...]6875
  ./fib 10000  0.03s user 0.01s system 89% cpu 0.039 total

(That's on an good ole Athlon64 3200+)

The version you are using is, for every n, computing fib (n-1) and fib (n-2), that is, has an complexity of roughly triangular nature. The version above is linear: Each fib is only computed once. Despite what the non-Haskell programming hivemind seems to think, Haskell does not automatically memoise (which would be slower, in general, than good ole dynamic programming, anyway).

There's even faster (using maths tricks) version of fibonnaci on the Haskell Wiki.

Change the C version to be non-recursive, and my bet is that you'll see both Haskell and C having very similar performance. Tight loops are just easier to optimise.

回复收藏 0 原文

~没有更多了~