编程语言中的堆栈性能

发布于 2024-09-30 17:29:46 字数 943 浏览 0 评论 0原文

只是为了好玩,我尝试比较几种使用朴素递归算法计算斐波那契数列的编程语言的堆栈性能。所有语言的代码基本上都是相同的,我将发布一个 java 版本:

public class Fib {
 public static int fib(int n) {
  if (n < 2) return 1;
  return fib(n-1) + fib(n-2);
 }

 public static void main(String[] args) {
  System.out.println(fib(Integer.valueOf(args[0])));
 }
}

好吧,重点是,使用带有输入 40 的算法,我得到了这些计时:

C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s

它们是使用每种可用语言的版本在 Ubuntu 10.04 框中获取的在官方存储库中,在双核英特尔机器上。

我知道像 ocaml 这样的函数式语言由于将函数视为一阶公民而导致速度减慢,并且可以毫无问题地解释 CPython 的运行时间,因为它是本次测试中唯一的解释性语言,但 java 的运行给我留下了深刻的印象对于相同的算法,时间是 c 的一半!您会将其归因于 JIT 编译吗?

您如何解释这些结果?

编辑:感谢您的有趣回复!我认识到这不是一个合适的基准(从来没有说过它是:P),也许我可以根据我们讨论的内容制作一个更好的基准并下次发布给您:)

编辑 2:我更新了ocaml 实现的运行时,使用优化编译器 ocamlopt。我还在 https://github.com/hoheinzollern/fib-test 发布了测试平台。如果您愿意,请随意添加:)

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:

public class Fib {
 public static int fib(int n) {
  if (n < 2) return 1;
  return fib(n-1) + fib(n-2);
 }

 public static void main(String[] args) {
  System.out.println(fib(Integer.valueOf(args[0])));
 }
}

Ok so the point is that using this algorithm with input 40 I got these timings:

C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s

They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.

I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?

How would you explain these results?

EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)

EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

一袭白衣梦中忆 2024-10-07 17:29:46

您可能想要提高 C 编译器的优化级别。使用gcc -O3,这会产生很大的差异,从 2.015 秒下降到 0.766 秒,减少了约 62%。

除此之外,您还需要确保测试正确。您应该运行每个程序十次,删除异常值(最快和最慢),然后对其他八次进行平均。

此外,请确保您测量的是 CPU 时间而不是时钟时间。

除此之外,我不会考虑一个像样的统计分析,它很可能会受到噪音的影响,使你的结果毫无用处。

就其价值而言,上面的 C 计时是七次运行,在平均之前剔除异常值。


事实上,这个问题表明了在追求高性能时算法选择是多么重要。尽管递归解决方案通常很优雅,但该解决方案存在重复大量计算的问题。迭代版本:

int fib(unsigned int n) {
    int t, a, b;
    if (n < 2) return 1;
    a = b = 1;
    while (n-- >= 2) {
        t = a + b;
        a = b;
        b = t;
    }
    return b;
}

进一步降低了所需时间,从 0.766 秒降至 0.078 秒,进一步减少了 89%,与原始代码相比,大幅减少了 96%。


并且,作为最后的尝试,您应该尝试以下方法,它将查找表与超过某一点的计算相结合:

static int fib(unsigned int n) {
    static int lookup[] = {
        1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
        610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
        46368, 75025, 121393, 196418, 317811, 514229, 832040,
        1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
        24157817, 39088169, 63245986, 102334155, 165580141 };
    int t, a, b;

    if (n < sizeof(lookup)/sizeof(*lookup))
        return lookup[n];
    a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
    b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
    while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
        t = a + b;
        a = b;
        b = t;
    }

    return b;
}

这再次减少了时间,但我怀疑我们在这里达到了收益递减的点。

You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.

Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.

In addition, make sure you're measuring CPU time rather than clock time.

Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.

For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.


In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:

int fib(unsigned int n) {
    int t, a, b;
    if (n < 2) return 1;
    a = b = 1;
    while (n-- >= 2) {
        t = a + b;
        a = b;
        b = t;
    }
    return b;
}

further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.


And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:

static int fib(unsigned int n) {
    static int lookup[] = {
        1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
        610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
        46368, 75025, 121393, 196418, 317811, 514229, 832040,
        1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
        24157817, 39088169, 63245986, 102334155, 165580141 };
    int t, a, b;

    if (n < sizeof(lookup)/sizeof(*lookup))
        return lookup[n];
    a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
    b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
    while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
        t = a + b;
        a = b;
        b = t;
    }

    return b;
}

That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.

染柒℉ 2024-10-07 17:29:46

你对你的配置说得很少(在基准测试中,细节就是一切:命令行、使用的计算机……)

当我尝试为 OCaml 重现时,我得到:

let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))

let () = Format.printf "%d@." (f 40)


$ ocamlopt fib.ml
$ time ./a.out 
165580141

real    0m1.643s

这是在 2.66GHz 的 Intel Xeon 5150(Core 2)上。另一方面,如果我使用字节码 OCaml 编译器 ocamlc,我会得到与您的结果类似的时间(11 秒)。但是,当然,为了运行速度比较,没有理由使用字节码编译器,除非您想对编译本身的速度进行基准测试(ocamlc 对于编译速度来说是惊人的)。

You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)

When I try to reproduce for OCaml I get:

let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))

let () = Format.printf "%d@." (f 40)


$ ocamlopt fib.ml
$ time ./a.out 
165580141

real    0m1.643s

This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).

记忆里有你的影子 2024-10-07 17:29:46

一种可能性是 C 编译器根据第一个分支 (n < 2) 是最常采用的分支这一猜测进行优化。它必须纯粹在编译时执行此操作:进行猜测并坚持下去。

Hotspot 可以运行代码,查看实际发生的情况,并根据该数据重新优化。

也许可以通过反转if的逻辑来看到差异:

public static int fib(int n) {
 if (n >= 2) return fib(n-1) + fib(n-2);
 return 1;
}

无论如何,这都值得一试:)

一如既往,也请检查所有平台的优化设置。显然,C 和 Java 的编译器设置,尝试使用 Hotspot 的客户端版本与服务器版本。 (请注意,您需要运行超过一秒左右才能真正获得 Hotspot 的全部好处...将外部调用放入循环中以获得一分钟左右的运行可能会很有趣。)

One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.

Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.

You may be able to see a difference by inverting the logic of the if:

public static int fib(int n) {
 if (n >= 2) return fib(n-1) + fib(n-2);
 return 1;
}

It's worth a try, anyway :)

As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)

述情 2024-10-07 17:29:46

我可以解释Python的性能。 Python 的递归性能充其量也很糟糕,在用它编写代码时应该像避免瘟疫一样避免它。特别是因为默认情况下,递归深度仅为 1000 时就会发生堆栈溢出……

至于 Java 的性能,那是惊人的。 Java 很少能击败 C(即使 C 方面的编译器优化非常少)...JIT 可能会做的是记忆化或尾递归...

I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...

As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...

暖心男生 2024-10-07 17:29:46

请注意,如果 Java Hotspot VM 足够智能,可以记住 fib() 调用,它可以将算法的指数成本降低到更接近线性成本。

Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.

昇り龍 2024-10-07 17:29:46

我编写了一个 C 版本的朴素斐波那契函数,并将其编译为 gcc (4.3.2 Linux) 中的汇编程序。然后我用 gcc -O3 编译它。

未优化的版本有 34 行长,看起来像是 C 代码的直接翻译。

优化版本有 190 行长(很难说),它似乎至少内联了对高达 5 左右的值的调用。

I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.

The unoptimised version was 34 lines long and looked like a straight translation of the C code.

The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.

烟织青萝梦 2024-10-07 17:29:46

对于 C,您应该将斐波那契函数声明为“内联”,或者使用 gcc 将 -finline-functions 参数添加到编译选项中。这将允许编译器进行递归内联。这也是使用 -O3 可以获得更好性能的原因,它会激活 -finline-functions

编辑 您至少需要指定 -O/-O1 才能进行递归内联,即使函数声明为内联也是如此。实际上,我自己测试发现,声明内联函数并使用 -O 作为编译标志,或者仅使用 -O -finline-functions,我的递归斐波那契代码比使用 -O2-O2 -finline-functions

With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.

Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.

水染的天色ゝ 2024-10-07 17:29:46

您可以尝试的一个 C 技巧是禁用堆栈检查(即内置代码,确保堆栈足够大以允许对当前函数的局部变量进行额外分配)。对于递归函数来说,这可能是冒险的,而且确实可能​​是 C 时间缓慢的原因:正在执行的程序很可能已经耗尽了堆栈空间,这迫使堆栈检查在实际运行期间多次重新分配整个堆栈。

尝试估算所需的堆栈大小,并强制链接器分配那么多的堆栈空间。然后禁用堆栈检查并重新编写程序。

One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.

Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文