如何加快这段 Java 代码的速度?
我正在尝试测试 Java 执行一项简单任务的速度有多快:将一个大文件读入内存,然后对数据执行一些无意义的计算。所有类型的优化都很重要。无论是以不同的方式重写代码还是使用不同的 JVM,欺骗 JIT ..
输入文件是一个由逗号分隔的 5 亿长的 32 位整数对列表。像这样:
44439,5023
33140,22257
...
这个文件在我的机器上占用5.5GB。该程序不能使用超过 8GB 的 RAM,并且只能使用单线程。
package speedracer;
import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class Main
{
public static void main(String[] args)
{
int[] list = new int[1000000000];
long start1 = System.nanoTime();
parse(list);
long end1 = System.nanoTime();
System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);
int rs = 0;
long start2 = System.nanoTime();
for (int k = 0; k < list.length; k++) {
rs = calc(list[k++], list[k++], list[k++], list[k]);
}
long end2 = System.nanoTime();
System.out.println(rs);
System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
}
public static int calc(final int a1, final int a2, final int b1, final int b2)
{
int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;
for (int z = 0; z < 100; z++) {
c1 ^= z + c2;
}
return c1;
}
public static void parse(int[] list)
{
FileChannel fc = null;
int i = 0;
MappedByteBuffer byteBuffer;
try {
fc = new FileInputStream("in.txt").getChannel();
long size = fc.size();
long allocated = 0;
long allocate = 0;
while (size > allocated) {
if ((size - allocated) > Integer.MAX_VALUE) {
allocate = Integer.MAX_VALUE;
} else {
allocate = size - allocated;
}
byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
byteBuffer.clear();
allocated += allocate;
int number = 0;
while (byteBuffer.hasRemaining()) {
char val = (char) byteBuffer.get();
if (val == '\n' || val == ',') {
list[i] = number;
number = 0;
i++;
} else {
number = number * 10 + (val - '0');
}
}
}
fc.close();
} catch (Exception e) {
System.err.println("Parsing error: " + e);
}
}
}
我已经尝试了所有我能想到的。尝试不同的阅读器,尝试过openjdk6、sunjdk6、sunjdk7。尝试过不同的读者。由于 MappedByteBuffer 无法一次映射超过 2GB 的内存,因此必须进行一些丑陋的解析。我正在运行:
Linux AS292 2.6.38-11-generic #48-Ubuntu SMP
Fri Jul 29 19:02:55 UTC 2011
x86_64 GNU/Linux. Ubuntu 11.04.
CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.
目前,我的结果是解析:26.50s,计算:11.27s。我正在与一个类似的 C++ 基准测试进行竞争,该基准测试的 IO 时间大致相同,但计算只需要 4.5 秒。我的主要目标是尽一切可能减少计算时间。有什么想法吗?
更新:看来主要的速度改进可能来自所谓的自动矢量化。我能够找到一些提示,表明当前 Sun 的 JIT 只进行“一些矢量化”,但我无法真正确认这一点。如果能找到一些具有更好自动向量化优化支持的 JVM 或 JIT,那就太好了。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
首先,
-O3
可以实现:除其他外......
所以看起来它实际上可能是矢量化。
编辑:
这已得到证实。 (参见评论) C++ 版本确实被编译器矢量化。禁用矢量化后,C++ 版本实际上比 Java 版本运行得慢一些。
假设 JIT 不对循环进行矢量化,Java 版本可能很难/不可能匹配 C++ 版本的速度。
现在,如果我是一个智能 C/C++ 编译器,我将按以下方式安排该循环(在 x64 上):
请注意,该循环是完全可矢量化的。
更好的是,我会完全展开这个循环。这些是 C/C++ 编译器要做的事情。但现在的问题是,JIT 会这么做吗?
First of all,
-O3
enables:among others...
So it looks like it actually might be vectorizing.
EDIT :
This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version
Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.
Now, if I were a smart C/C++ compiler, here's how I would arrange that loop (on x64):
Note that this loop is completely vectorizable.
Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?
在服务器模式下使用 Hotspot JVM,并确保 预热。如果收集是测试的主要部分,还要给垃圾收集算法足够的时间以稳定的速度。我乍一看没有看到任何让我认为它会是......
Use the Hotspot JVM in server mode, and make sure to warm it up. Also give enough time for the garbage collection algorithms to settle down to a stable pace if collection is a major part of your test. I don't see anything at a glance that makes me think it would be...
有趣的问题。 :-) 这可能更像是一条评论,因为我不会真正回答你的问题,但对于评论框来说太长了。
Java 中的微基准测试很棘手,因为 JIT 可能会因优化而发疯。但是这个特定的代码以某种方式欺骗了 JIT,使其无法执行正常的优化。
通常,此代码将在 O(1) 时间内运行,因为主循环对任何内容都没有影响:
请注意,rs 的最终结果并不真正依赖于运行循环的所有迭代;只是最后一张。您可以计算循环的“k”的最终值,而无需实际运行循环。通常,JIT 会注意到这一点并将循环转换为单个赋值,它能够检测到被调用的函数 (calc) 没有副作用(它没有)。
但是,不知何故, calc() 函数中的这条语句搞乱了 JIT:
不知何故,这增加了太多的复杂性,让 JIT 无法决定所有这些代码最终不会改变任何东西,并且可以优化原始循环。
如果您将该特定语句更改为更无意义的内容,例如:
然后 JIT 会拾取内容并优化您的循环。尝试一下。 :-)
我在本地尝试使用更小的数据集,并且使用“^=”版本计算花费了约 1.6 秒,而使用“=”版本则花费了 0.007 秒(或者,换句话说,它优化了循环) 。
正如我所说,这并不是真正的回应,但我认为这可能很有趣。
Interesting question. :-) This is probably more of a comment since I won't really answer your question, but it's too long for the comment box.
Micro-benchmarking in Java is tricky because the JIT can go nuts with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its normal optimizations.
Normally, this code would run in O(1) time because your main loop has no effect on anything:
Note that the final result of rs doesn't really depend on running all iterations of the loop; just the last one. You can calculate the final value of "k" for the loop without having to actually run the loop. Normally the JIT would notice that and turn your loop into a single assignment, it it's able to detect that the function being called (calc) has no side-effects (which it doesn't).
But, somehow, this statement in the calc() function messes up the JIT:
Somehow that adds too much complexity for the JIT to decide that all this code in the end doesn't change anything and that the original loop can be optimized out.
If you change that particular statement to something even more pointless, like:
Then the JIT picks things up and optimizes your loops away. Try it out. :-)
I tried locally with a much smaller data set and with the "^=" version calculations took ~1.6s, while with the "=" version they took 0.007 seconds (or, in other words, it optimized away the loop).
As I said, not really a response, but I thought this might be interesting.
您是否尝试过“内联”parse() 和 calc(),即将所有代码放入 main() 中?
Did you try "inlining" parse() and calc(), i.e. put all the code in main()?
如果将 calc 函数的几行移到列表迭代中,得分是多少?
我知道它不是很干净,但您将受益于调用堆栈。
What is the score if you move the few lines of your calc function inside of your list iteration?
I know it's not very clean, but you'll gain over the call stack.
MappedByteBuffer 仅贡献了大约 20% 的 I/O 性能,并且内存成本巨大 - 如果它导致交换,那么治愈方法比疾病本身更糟糕。
我会在 FileReader 周围使用 BufferedReader,也许在它周围使用 Scanner 来获取整数,或者至少使用 Integer.parseInt(),它比您自己的基数转换代码更有可能被 HotSpot 预热。
The MappedByteBuffer is only contributing about 20% in I/O performance and it is an enormous memory cost - if it causes swapping the cure is worse than the disease.
I would use a BufferedReader around a FileReader, and maybe a Scanner around that to get the integers, or at least Integer.parseInt(), which is a lot more likely to have been warmed up by HotSpot than your own radix conversion code.
如果任务是进行无意义的计算,那么最好的优化就是不进行计算。
如果你真正想做的是弄清楚是否有一种通用技术可以使计算速度更快,那么我认为你找错了方向。没有这样的技术。您在优化无意义计算中学到的知识不太可能适用于其他(希望有意义)计算。
如果计算不是毫无意义,并且目的是使整个程序运行得更快,那么您可能已经达到了优化浪费时间的地步。
对于大约 40 秒的计算来说,加速率低于 20% 可能不值得付出努力。让用户在这额外的 7 秒内转动拇指会更便宜。
这也告诉你一些有趣的事情。在这种情况下,无论您使用 C++ 还是 Java,相对而言都没有太大区别。程序整体性能处于主导地位,C++和Java不相上下。
If the task is to do a meaningless calculation, then the best optimization is to not do the calculation.
If what you are really trying to do here is to figure out if there is a general technique to make a computation go faster, then I think you are barking up the wrong tree. There is no such technique. What you learn on optimizing a meaningless calculation is not likely to apply to other (hopefully meaningfull) calculations.
If calculation is not meaningless, and the aim is to make the whole program go faster, you've probably already reached the point where optimization is a waste of time.
A speedup of less than 20% for a ~40 second computation is probably not worth the effort. It is cheaper to get the user to twiddle his thumbs for those extra 7 seconds.
This is also telling you something interesting. That in this scenario, it doesn't make much difference in relative terms whether you use C++ or Java. The overall performance of the program is dominated by a phase in which C++ and Java are comparable.