当前位置：文江博客话题详情

如何加快这段 Java 代码的速度？

发布于 2024-12-05 02:09:47 字数 3364 浏览 5 评论 0 原文

我正在尝试测试 Java 执行一项简单任务的速度有多快：将一个大文件读入内存，然后对数据执行一些无意义的计算。所有类型的优化都很重要。无论是以不同的方式重写代码还是使用不同的 JVM，欺骗 JIT ..

输入文件是一个由逗号分隔的 5 亿长的 32 位整数对列表。像这样：

44439,5023
33140,22257
...

这个文件在我的机器上占用5.5GB。该程序不能使用超过 8GB 的 RAM，并且只能使用单线程。

package speedracer;

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class Main
{
    public static void main(String[] args)
    {
        int[] list = new int[1000000000];

        long start1 = System.nanoTime();
        parse(list);
        long end1 = System.nanoTime();

        System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);

        int rs = 0;
        long start2 = System.nanoTime();

        for (int k = 0; k < list.length; k++) {
            rs = calc(list[k++], list[k++], list[k++], list[k]);
        }

        long end2 = System.nanoTime();

        System.out.println(rs);
        System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
    }

    public static int calc(final int a1, final int a2, final int b1, final int b2)
    {
        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        return c1;
    }

    public static void parse(int[] list)
    {
        FileChannel fc = null;
        int i = 0;

        MappedByteBuffer byteBuffer;

        try {
            fc = new FileInputStream("in.txt").getChannel();

            long size = fc.size();
            long allocated = 0;
            long allocate = 0;

            while (size > allocated) {

               if ((size - allocated) > Integer.MAX_VALUE) {
                   allocate = Integer.MAX_VALUE;
               } else {
                   allocate = size - allocated;
               }

               byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
               byteBuffer.clear();

               allocated += allocate;

               int number = 0;

               while (byteBuffer.hasRemaining()) {
                   char val = (char) byteBuffer.get();
                   if (val == '\n' || val == ',') {
                        list[i] = number;

                        number = 0;
                        i++;
                   } else {
                       number = number * 10 + (val - '0');
                   }
                }
            }

            fc.close();

        } catch (Exception e) {
            System.err.println("Parsing error: " + e);
        }
    }
}

我已经尝试了所有我能想到的。尝试不同的阅读器，尝试过openjdk6、sunjdk6、sunjdk7。尝试过不同的读者。由于 MappedByteBuffer 无法一次映射超过 2GB 的内存，因此必须进行一些丑陋的解析。我正在运行：

   Linux AS292 2.6.38-11-generic #48-Ubuntu SMP 
   Fri Jul 29 19:02:55 UTC 2011 
   x86_64 GNU/Linux. Ubuntu 11.04. 
   CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.

目前，我的结果是解析：26.50s，计算：11.27s。我正在与一个类似的 C++ 基准测试进行竞争，该基准测试的 IO 时间大致相同，但计算只需要 4.5 秒。我的主要目标是尽一切可能减少计算时间。有什么想法吗？

更新：看来主要的速度改进可能来自所谓的自动矢量化。我能够找到一些提示，表明当前 Sun 的 JIT 只进行“一些矢量化”，但我无法真正确认这一点。如果能找到一些具有更好自动向量化优化支持的 JVM 或 JIT，那就太好了。

原文

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data. All types of optimizations count. Whether it's rewriting the code differently or using a different JVM, tricking JIT ..

Input file is a 500 million long list of 32 bit integer pairs separated by a comma. Like this:

44439,5023
33140,22257
...

This file takes 5.5GB on my machine. The program can't use more than 8GB of RAM and can use only a single thread.

package speedracer;

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class Main
{
    public static void main(String[] args)
    {
        int[] list = new int[1000000000];

        long start1 = System.nanoTime();
        parse(list);
        long end1 = System.nanoTime();

        System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);

        int rs = 0;
        long start2 = System.nanoTime();

        for (int k = 0; k < list.length; k++) {
            rs = calc(list[k++], list[k++], list[k++], list[k]);
        }

        long end2 = System.nanoTime();

        System.out.println(rs);
        System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
    }

    public static int calc(final int a1, final int a2, final int b1, final int b2)
    {
        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        return c1;
    }

    public static void parse(int[] list)
    {
        FileChannel fc = null;
        int i = 0;

        MappedByteBuffer byteBuffer;

        try {
            fc = new FileInputStream("in.txt").getChannel();

            long size = fc.size();
            long allocated = 0;
            long allocate = 0;

            while (size > allocated) {

               if ((size - allocated) > Integer.MAX_VALUE) {
                   allocate = Integer.MAX_VALUE;
               } else {
                   allocate = size - allocated;
               }

               byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
               byteBuffer.clear();

               allocated += allocate;

               int number = 0;

               while (byteBuffer.hasRemaining()) {
                   char val = (char) byteBuffer.get();
                   if (val == '\n' || val == ',') {
                        list[i] = number;

                        number = 0;
                        i++;
                   } else {
                       number = number * 10 + (val - '0');
                   }
                }
            }

            fc.close();

        } catch (Exception e) {
            System.err.println("Parsing error: " + e);
        }
    }
}

I've tried all I could think of. Trying different readers, tried openjdk6, sunjdk6, sunjdk7. Tried different readers. Had to do some ugly parsing since MappedByteBuffer cannot map more than 2GB of memory at once. I'm running:

   Linux AS292 2.6.38-11-generic #48-Ubuntu SMP 
   Fri Jul 29 19:02:55 UTC 2011 
   x86_64 GNU/Linux. Ubuntu 11.04. 
   CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.

Currently, my results are for parsing: 26.50s, calculations: 11.27s. I'm competing against a similar C++ benchmark which does the IO in roughly the same time but the calculations take only 4.5s. My main objective is to reduce the calculation time in any means possible. Any ideas?

Update: It seems the main speed improvement could come from what is called Auto-Vectorization. I was able to find some hints that the current Sun's JIT only does "some vectorization" however I can't really confirm it. It would be great to find some JVM or JIT that would have better auto-vectorization optimization support.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄暮涼年 2024-12-12 02:09:47

首先，-O3 可以实现：

-finline-functions
-ftree-vectorize

除其他外......

所以看起来它实际上可能是矢量化。

编辑：
这已得到证实。（参见评论） C++ 版本确实被编译器矢量化。禁用矢量化后，C++ 版本实际上比 Java 版本运行得慢一些。

假设 JIT 不对循环进行矢量化，Java 版本可能很难/不可能匹配 C++ 版本的速度。

现在，如果我是一个智能 C/C++ 编译器，我将按以下方式安排该循环（在 x64 上）：

int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;

int tmp0 = c1;
int tmp1 = 0;
int tmp2 = 0;
int tmp3 = 0;

int z0 = 0;
int z1 = 1;
int z2 = 2;
int z3 = 3;

do{
    tmp0 ^= z0 + c2;
    tmp1 ^= z1 + c2;
    tmp2 ^= z2 + c2;
    tmp3 ^= z3 + c2;
    z0 += 4;
    z1 += 4;
    z2 += 4;
    z3 += 4;
}while (z0 < 100);

tmp0 ^= tmp1;
tmp2 ^= tmp3;

tmp0 ^= tmp2;

return tmp0;

请注意，该循环是完全可矢量化的。

更好的是，我会完全展开这个循环。这些是 C/C++ 编译器要做的事情。但现在的问题是，JIT 会这么做吗？

First of all, -O3 enables:

-finline-functions
-ftree-vectorize

among others...

So it looks like it actually might be vectorizing.

EDIT :
This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version

Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.

Now, if I were a smart C/C++ compiler, here's how I would arrange that loop (on x64):

int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;

int tmp0 = c1;
int tmp1 = 0;
int tmp2 = 0;
int tmp3 = 0;

int z0 = 0;
int z1 = 1;
int z2 = 2;
int z3 = 3;

do{
    tmp0 ^= z0 + c2;
    tmp1 ^= z1 + c2;
    tmp2 ^= z2 + c2;
    tmp3 ^= z3 + c2;
    z0 += 4;
    z1 += 4;
    z2 += 4;
    z3 += 4;
}while (z0 < 100);

tmp0 ^= tmp1;
tmp2 ^= tmp3;

tmp0 ^= tmp2;

return tmp0;

Note that this loop is completely vectorizable.

Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?

回复收藏 0 原文

柏林苍穹下 2024-12-12 02:09:47

在服务器模式下使用 Hotspot JVM，并确保预热。如果收集是测试的主要部分，还要给垃圾收集算法足够的时间以稳定的速度。我乍一看没有看到任何让我认为它会是......

回复收藏 0 原文

柏林苍穹下 2024-12-12 02:09:47

有趣的问题。 :-) 这可能更像是一条评论，因为我不会真正回答你的问题，但对于评论框来说太长了。

Java 中的微基准测试很棘手，因为 JIT 可能会因优化而发疯。但是这个特定的代码以某种方式欺骗了 JIT，使其无法执行正常的优化。

通常，此代码将在 O(1) 时间内运行，因为主循环对任何内容都没有影响：

    for (int k = 0; k < list.length; k++) {
        rs = calc(list[k++], list[k++], list[k++], list[k]);
    }

请注意，rs 的最终结果并不真正依赖于运行循环的所有迭代；只是最后一张。您可以计算循环的“k”的最终值，而无需实际运行循环。通常，JIT 会注意到这一点并将循环转换为单个赋值，它能够检测到被调用的函数 (calc) 没有副作用（它没有）。

但是，不知何故， calc() 函数中的这条语句搞乱了 JIT：

        c1 ^= z + c2;

不知何故，这增加了太多的复杂性，让 JIT 无法决定所有这些代码最终不会改变任何东西，并且可以优化原始循环。

如果您将该特定语句更改为更无意义的内容，例如：

        c1 = z + c2;

然后 JIT 会拾取内容并优化您的循环。尝试一下。 :-)

我在本地尝试使用更小的数据集，并且使用“^=”版本计算花费了约 1.6 秒，而使用“=”版本则花费了 0.007 秒（或者，换句话说，它优化了循环）。

正如我所说，这并不是真正的回应，但我认为这可能很有趣。

Interesting question. :-) This is probably more of a comment since I won't really answer your question, but it's too long for the comment box.

Micro-benchmarking in Java is tricky because the JIT can go nuts with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its normal optimizations.

Normally, this code would run in O(1) time because your main loop has no effect on anything:

    for (int k = 0; k < list.length; k++) {
        rs = calc(list[k++], list[k++], list[k++], list[k]);
    }

Note that the final result of rs doesn't really depend on running all iterations of the loop; just the last one. You can calculate the final value of "k" for the loop without having to actually run the loop. Normally the JIT would notice that and turn your loop into a single assignment, it it's able to detect that the function being called (calc) has no side-effects (which it doesn't).

But, somehow, this statement in the calc() function messes up the JIT:

        c1 ^= z + c2;

Somehow that adds too much complexity for the JIT to decide that all this code in the end doesn't change anything and that the original loop can be optimized out.

If you change that particular statement to something even more pointless, like:

        c1 = z + c2;

Then the JIT picks things up and optimizes your loops away. Try it out. :-)

I tried locally with a much smaller data set and with the "^=" version calculations took ~1.6s, while with the "=" version they took 0.007 seconds (or, in other words, it optimized away the loop).

As I said, not really a response, but I thought this might be interesting.

回复收藏 0 原文

絕版丫頭 2024-12-12 02:09:47

您是否尝试过“内联”parse() 和 calc()，即将所有代码放入 main() 中？

回复收藏 0 原文

櫻之舞 2024-12-12 02:09:47

如果将 calc 函数的几行移到列表迭代中，得分是多少？
我知道它不是很干净，但您将受益于调用堆栈。

[...]
    for (int k = 0; k < list.length; k++) {
        int a1 = list[k++];
        int a2 = list[k++];
        int b1 = list[k++];
        int b2 = list[k];

        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        rs = c1;
    }

What is the score if you move the few lines of your calc function inside of your list iteration?
I know it's not very clean, but you'll gain over the call stack.

[...]
    for (int k = 0; k < list.length; k++) {
        int a1 = list[k++];
        int a2 = list[k++];
        int b1 = list[k++];
        int b2 = list[k];

        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        rs = c1;
    }

回复收藏 0 原文