为什么此 Java 代码比相同的 C# 代码快 6 倍？

发布于 2024-11-06 05:27:49 字数 2283 浏览 5 评论 0原文

我对 Project Euler 问题 5 有几种不同的解决方案，但是在这个特定的实现中，两种语言/平台之间的执行时间差异引起了我的兴趣。我没有对编译器标志进行任何优化，只是简单的 javac （通过命令行）和 csc （通过 Visual Studio）。

这是 Java 代码。 55 毫秒内完成。

public class Problem005b
{
    public static void main(String[] args)
    {
        long begin = System.currentTimeMillis();
        int i = 20;
        while (true)
        {
            if (
                    (i % 19 == 0) &&
                    (i % 18 == 0) &&
                    (i % 17 == 0) &&
                    (i % 16 == 0) &&
                    (i % 15 == 0) &&
                    (i % 14 == 0) &&
                    (i % 13 == 0) &&
                    (i % 12 == 0) &&
                    (i % 11 == 0)
                )
            {
                break;
            }
            i += 20;
        }
        long end = System.currentTimeMillis();
        System.out.println(i);
        System.out.println(end-begin + "ms");
    }   
}

这是相同的 C# 代码。 320ms 内完成

using System;

namespace ProjectEuler05
{
    class Problem005
    {
        static void Main(String[] args)
        {
            DateTime begin = DateTime.Now;
            int i = 20;
            while (true)
            {
                if (
                        (i % 19 == 0) &&
                        (i % 18 == 0) &&
                        (i % 17 == 0) &&
                        (i % 16 == 0) &&
                        (i % 15 == 0) &&
                        (i % 14 == 0) &&
                        (i % 13 == 0) &&
                        (i % 12 == 0) &&
                        (i % 11 == 0)
                    )
                    {
                        break;
                    }
                i += 20;
            }
            DateTime end = DateTime.Now;
            TimeSpan elapsed = end - begin;
            Console.WriteLine(i);
            Console.WriteLine(elapsed.TotalMilliseconds + "ms");
        }
    }
}

原文

I have a few different solutions to Project Euler problem 5, but the execution time difference between the two languages/platforms in this particular implementation intrigues me. I didn't do any optimization with compiler flags, just plain javac (via commandline) and csc (via Visual Studio).

Here's the Java code. It finishes in 55ms.

public class Problem005b
{
    public static void main(String[] args)
    {
        long begin = System.currentTimeMillis();
        int i = 20;
        while (true)
        {
            if (
                    (i % 19 == 0) &&
                    (i % 18 == 0) &&
                    (i % 17 == 0) &&
                    (i % 16 == 0) &&
                    (i % 15 == 0) &&
                    (i % 14 == 0) &&
                    (i % 13 == 0) &&
                    (i % 12 == 0) &&
                    (i % 11 == 0)
                )
            {
                break;
            }
            i += 20;
        }
        long end = System.currentTimeMillis();
        System.out.println(i);
        System.out.println(end-begin + "ms");
    }   
}

Here is the identical C# code. It finishes in 320ms

using System;

namespace ProjectEuler05
{
    class Problem005
    {
        static void Main(String[] args)
        {
            DateTime begin = DateTime.Now;
            int i = 20;
            while (true)
            {
                if (
                        (i % 19 == 0) &&
                        (i % 18 == 0) &&
                        (i % 17 == 0) &&
                        (i % 16 == 0) &&
                        (i % 15 == 0) &&
                        (i % 14 == 0) &&
                        (i % 13 == 0) &&
                        (i % 12 == 0) &&
                        (i % 11 == 0)
                    )
                    {
                        break;
                    }
                i += 20;
            }
            DateTime end = DateTime.Now;
            TimeSpan elapsed = end - begin;
            Console.WriteLine(i);
            Console.WriteLine(elapsed.TotalMilliseconds + "ms");
        }
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

殤城〤 2024-11-13 05:27:49

要计算代码执行时间，您应该使用 StopWatch类。
另外，您还必须考虑 JIT、运行时等，因此让测试运行足够的次数（例如 10,000、100,000 次）并获得某种平均值。重要的是多次运行代码，而不是程序。因此，编写一个方法，并在主方法中循环来获取测量值。
从程序集中删除所有调试内容，并让代码在发布版本中独立运行

回复收藏 0 原文

幻想少年梦 2024-11-13 05:27:49

有一些可能的优化。也许 Java JIT 正在执行它们，而 CLR 没有。

优化 #1：

(x % a == 0) && (x % b == 0) && ... && (x % z == 0)

相当于

(x % lcm(a, b, ... z) == 0)

因此在您的示例中，比较链可以替换为

if (i % 232792560 == 0) break;

(但是当然，如果您已经计算了 LCM，那么首先运行该程序就没有什么意义了！ )

优化#2：

这也是等价的：

if (i % (14549535 * 16)) == 0 break;

或者

if ((i % 16 == 0) && (i % 14549535 == 0)) break;

第一个除法可以用掩码替换并与零比较：

if (((i & 15) == 0) && (i % 14549535 == 0)) break;

第二个除法可以用模逆的乘法替换：

final long LCM = 14549535;
final long INV_LCM = 8384559098224769503L; // == 14549535**-1 mod 2**64
final long MAX_QUOTIENT = Long.MAX_VALUE / LCM;
// ...
if (((i & 15) == 0) &&
    (0 <= (i>>4) * INV_LCM) &&
    ((i>>4) * INV_LCM < MAX_QUOTIENT)) {
    break;
}

这有点不太可能JIT 正在使用这个，但它是并不像您想象的那么牵强 - 一些 C 编译器以这种方式实现指针减法。

There are a few optimizations possible. Maybe the Java JIT is performing them and the CLR is not.

Optimization #1:

(x % a == 0) && (x % b == 0) && ... && (x % z == 0)

is equivalent to

(x % lcm(a, b, ... z) == 0)

So in your example the comparison chain could be replaced by

if (i % 232792560 == 0) break;

(but of course if you've already calculated the LCM, there's little point in running the program in the first place!)

Optimization #2:

This is also equivalent:

if (i % (14549535 * 16)) == 0 break;

if ((i % 16 == 0) && (i % 14549535 == 0)) break;

The first division can be replaced with a mask and compare against zero:

if (((i & 15) == 0) && (i % 14549535 == 0)) break;

The second division can be replaced by a multiplication by the modular inverse:

final long LCM = 14549535;
final long INV_LCM = 8384559098224769503L; // == 14549535**-1 mod 2**64
final long MAX_QUOTIENT = Long.MAX_VALUE / LCM;
// ...
if (((i & 15) == 0) &&
    (0 <= (i>>4) * INV_LCM) &&
    ((i>>4) * INV_LCM < MAX_QUOTIENT)) {
    break;
}

It is somewhat unlikely that the JIT is employing this, but it is not as far-fetched as you might think - some C compilers implement pointer subtraction this way.

回复收藏 0 原文

恋你朝朝暮暮 2024-11-13 05:27:49

让两者变得更接近的关键是确保比较的公平性。

首先，确保与运行调试构建相关的成本，像您一样加载 pdb 符号。

接下来，您需要确保没有计算初始成本。显然，这些是实际成本，对某些人来说可能很重要，但在这种情况下，我们对循环本身感兴趣。

接下来您需要处理平台特定的行为。如果您使用的是 64 位 Windows 计算机，您可能会以 32 位或 64 位模式运行。在 64 位模式下，JIT 在许多方面都不同，通常会大大改变生成的代码。具体来说，我猜测您可以访问两倍的通用寄存器。

在这种情况下，循环的内部部分，当天真地翻译成机器代码时，需要将模测试中使用的常量加载到寄存器中。如果不足以容纳循环中所需的所有内容，则必须将它们从内存中推入。即使来自一级缓存，与将其全部保存在寄存器中相比，这也会是一个重大打击。

在 VS 2010 MS 将默认目标从anycpu更改为x86。我没有像 MSFT 那样的资源或面向客户的知识，所以我不会尝试再次猜测。然而，任何关注您正在做的性能分析之类的事情的人都应该尝试这两种方法。

一旦这些差异得到消除，这些数字似乎就更加合理了。任何进一步的差异可能需要比有根据的猜测更好的方法，而是需要调查生成的机器代码中的实际差异。

我认为对于优化编译器来说，有几件事会很有趣。

finnw 已经提到过：
- lcm 选项很有趣，但我不认为编译器编写者会为此烦恼。
- 将除法简化为乘法和掩码。
  - 我对此了解不够，但其他人已经尝试过请注意，他们在最新的英特尔芯片上显着地调用了分隔符更好。
  - 也许您甚至可以使用 SSE2 安排一些复杂的事情。
  - 当然，模 16 运算已经可以转换为掩码或移位了。
- 编译器可以发现所有测试都没有副作用。
  - 它可以推测性地尝试同时评估其中的几个，在超标量处理器上，这可以使事情进展得更快，但在很大程度上取决于编译器布局与 OO 执行引擎交互的程度。
- 如果寄存器压力很紧张，您可以将常量实现为单个变量，在每个循环开始时设置，然后随着循环的进行而递增。

这些都是纯粹的猜测，应该被视为无意义的曲折。如果你想知道拆解它。

The key to making these two become closer is to ensure that the comparison is fair.

First of all ensuring that costs associated with running Debug builds, loading pdb symbols as you did.

Next you need to ensure that there are no init costs being counted. Obviously these are real costs, and may matter to some people, but in this instance we are interested in the loop itself.

Next you need to deal with the platform specific behaviour. If you are on a 64bit windows machine you may be running either in 32bit or 64bit mode. In 64bit mode the JIT is different in many respects, often altering the resulting code considerably. Specifically, and I would guess pertinently, you get access to twice as many general purpose registers.

In this case the inner section of the loop, when naively translated into machine code, would need to load into registers the constants used in the modulo tests. If there are insufficient to hold everything needed in the loop then it must push them in from memory. Even coming from level1 cache this would be a significant hit compared to keeping it all in registers.

In VS 2010 MS changed the default target from anycpu to x86. I have nothing like the resources or customer facing knowledge of MSFT so I won't try to second guess that. However anyone looking at anything like the performance analysis you are doing should certainly try both.

Once those disparities are ironed out the numbers seem far more reasonable. Any further differences likely require better than educated guesses, instead they would need investigation into the actual differences in the generated machine code.

There are several things about this I think would be interesting for an optimising compiler.

The ones finnw already mentioned:
- The lcm option interesting but I can't see a compiler writer bothering.
- the reduction of division to multiplication and masking.
  - I don't know enough about this, but other people have tried note that they call out the divider on the more recent intel chips significantly better.
  - Perhaps you could even arrange something complex, with SSE2.
  - Certainly the modulo 16 operation is ripe for conversion into a mask or shift.
- A compiler could spot that none of the tests have side effects.
  - it could speculatively try to evaluate several of them at once, on a super scalar processor this could pump things along quite a bit faster, but would depend heavily on how well the compilers layout interacted with the OO execution engine.
- If register pressure was tight you could implement the constants as a single variable, set at the start of each loop then increment as you go along.

These are all utter guesses, and should be viewed as the idle meanderings. If you want to know disassemble it.

回复收藏 0 原文

你另情深 2024-11-13 05:27:49

（从 OP 移出）

将目标从 x86 更改为 anycpu 已降低每次运行的平均执行时间从 282 毫秒降至 84 毫秒。也许我应该把它分成第二个线程？

更新：
感谢下面的 Femaref 指出了一些测试问题，事实上，在遵循他的建议后，时间变短了，这表明 VM 设置时间在 Java 中很重要，但在 C# 中可能不是。在 C# 中，调试符号非常重要。

我更新了代码，让每个循环运行 10,000 次，最后只输出平均毫秒数。我所做的唯一重大更改是 C# 版本，我切换到 [StopWatch 类][3] 以获得更高的分辨率。我坚持使用毫秒，因为它已经足够好了。

结果：
测试更改并不能解释为什么 Java（仍然）比 C# 快得多。 C# 性能更好，但这完全可以通过删除调试符号来解释。如果您阅读 [Mike Two][4] 和我对此 OP 所附评论的交流，您会发现仅通过从“调试”切换到“发布”，我在 C# 代码的五次运行中平均获得了约 280 毫秒。

数字：

未修改的 Java 代码的 10,000 次计数循环平均为 45 毫秒（从 55 毫秒下降）
使用 StopWatch 类的 C# 代码的 10,000 次计数循环平均为 282 毫秒（从 320 毫秒下降）

全部这使得差异无法解释。事实上，差距变得更糟。 Java 从大约 5.8 倍快到大约 6.2 倍。

回复收藏 0 原文

疧_╮線 2024-11-13 05:27:49

这项任务太短，无法安排适当的时间。您需要运行至少 1000 次，看看会发生什么。看起来像是从命令行运行它们，在这种情况下，您可能会比较两者的 JIT 编译器。尝试将两个按钮放在一个简单的 GUI 中，并让该按钮在返回经过的时间之前至少循环数百次。即使忽略 JIT 编译，操作系统调度程序的粒度也可能会导致时间错乱。

哦，由于 JIT...只计算按钮按下的第二个结果。 :)

回复收藏 0 原文