第一次 Java 循环运行缓慢,为什么? [Sun HotSpot 1.5,sparc]
在对 Solaris SPARC 机器上的一些 Java 代码进行基准测试时,我注意到第一次调用基准测试函数时,它运行得非常慢(相差 10 倍)
- : 1 | 25295.979 毫秒
- 第二 | 1 | 2256.990 毫秒
- 第三 | 1 | 2250.575 ms
这是为什么? 我怀疑JIT编译器,有什么方法可以验证这一点吗?
编辑:根据一些答案,我想澄清一下这段代码是最简单的 我可以找到表现出这种行为的可能测试用例。 所以我的目标不是获得 它跑得快,但要了解发生了什么,这样我就可以在现实中避免它 基准。
已解决: Tom Hawtin 正确地指出我的“慢”时间实际上是合理的。 根据这一观察,我将调试器附加到 Java 进程。 在第一次期间,内部循环如下所示:
0xf9037218: cmp %l0, 100
0xf903721c: bge,pn %icc,0xf90371f4 ! 0xf90371f4
0xf9037220: nop
0xf9037224: ld [%l3 + 92], %l2
0xf9037228: ld [%l2 + 8], %l6
0xf903722c: add %l6, 1, %l5
0xf9037230: st %l5, [%l2 + 8]
0xf9037234: inc %l0
0xf9037238: ld [%l1], %g0
0xf903723c: ba,pt %icc,0xf9037218 ! 0xf9037218
在接下来的迭代中,循环如下所示:
0xf90377d4: sub %l2, %l0, %l3
0xf90377d8: add %l3, %l0, %l2
0xf90377dc: add %l2, 1, %l4
0xf90377e0: inc %l0
0xf90377e4: cmp %l0, 100
0xf90377e8: bl,pn %icc,0xf90377d8 ! 0xf90377d8
因此 HotSpot 从内部循环中删除了内存访问,将其速度提高了一个数量级。
课程:做数学题! 我应该自己做汤姆的计算。
基准 Java 代码:
private int counter;
private int nThreads;
private void measure(String tag) throws Exception {
MyThread threads[] = new MyThread[nThreads];
int i;
counter = 0;
for (i = 0; i < nThreads; i++)
threads[i] = new MyThread();
long start = System.nanoTime();
for (i = 0; i < nThreads; i++)
threads[i].start();
for (i = 0; i < nThreads; i++)
threads[i].join();
if (tag != null)
System.out.format("%-20s | %-2d | %.3f ms \n", tag, nThreads,
new Double((System.nanoTime() - start) / 1000000.0));
}
public MyBench() {
try {
this.nThreads = 1;
measure("First");
measure("Second");
measure("Third");
} catch (Exception e) {
System.out.println("Error: " + e);
}
}
private class MyThread extends Thread {
public void run() {
while (counter < 10000000) {
// work
for (int j = 0; j < 100; j++)
counter++;
counter -= 99;
}
}
}
In benchmarking some Java code on a Solaris SPARC box, I noticed that the first time I call the benchmarked function it runs EXTREMELY slowly (10x difference):
- First | 1 | 25295.979 ms
- Second | 1 | 2256.990 ms
- Third | 1 | 2250.575 ms
Why is this? I suspect the JIT compiler, is there any way to verify this?
Edit: In light of some answers I wanted to clarify that this code is the simplest
possible test-case I could find exhibiting this behavior. So my goal isn't to get
it to run fast, but to understand what's going on so I can avoid it in my real
benchmarks.
Solved: Tom Hawtin correctly pointed out that my "SLOW" time was actually reasonable.
Following this observation, I attached a debugger to the Java process. During the first, the inner loop looks like this:
0xf9037218: cmp %l0, 100
0xf903721c: bge,pn %icc,0xf90371f4 ! 0xf90371f4
0xf9037220: nop
0xf9037224: ld [%l3 + 92], %l2
0xf9037228: ld [%l2 + 8], %l6
0xf903722c: add %l6, 1, %l5
0xf9037230: st %l5, [%l2 + 8]
0xf9037234: inc %l0
0xf9037238: ld [%l1], %g0
0xf903723c: ba,pt %icc,0xf9037218 ! 0xf9037218
On the following iterations, the loop looks like this:
0xf90377d4: sub %l2, %l0, %l3
0xf90377d8: add %l3, %l0, %l2
0xf90377dc: add %l2, 1, %l4
0xf90377e0: inc %l0
0xf90377e4: cmp %l0, 100
0xf90377e8: bl,pn %icc,0xf90377d8 ! 0xf90377d8
So HotSpot removed memory accesses from the inner loop, speeding it up by an order of magnitude.
Lesson: Do the math! I should have done Tom's calculation myself.
Benchmark Java code:
private int counter;
private int nThreads;
private void measure(String tag) throws Exception {
MyThread threads[] = new MyThread[nThreads];
int i;
counter = 0;
for (i = 0; i < nThreads; i++)
threads[i] = new MyThread();
long start = System.nanoTime();
for (i = 0; i < nThreads; i++)
threads[i].start();
for (i = 0; i < nThreads; i++)
threads[i].join();
if (tag != null)
System.out.format("%-20s | %-2d | %.3f ms \n", tag, nThreads,
new Double((System.nanoTime() - start) / 1000000.0));
}
public MyBench() {
try {
this.nThreads = 1;
measure("First");
measure("Second");
measure("Third");
} catch (Exception e) {
System.out.println("Error: " + e);
}
}
private class MyThread extends Thread {
public void run() {
while (counter < 10000000) {
// work
for (int j = 0; j < 100; j++)
counter++;
counter -= 99;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
一些丑陋的、不切实际的代码(微基准测试的东西):
那么它在做什么以及它应该运行多快。
内部循环将 counter 递增 100 次,然后 counter 递减 99。因此递增 1。注意 counter 是外部类的成员变量,因此存在一些开销。 然后运行 10,000,000 次。 因此内循环运行了 1,000,000,000 次。
用于访问方法的循环,称为 25 个循环。 1 GHz 下 1,000,000,000 次,需要 25 秒。
嘿,我们预测了慢时间。 慢的时光却很快。 快速时间是在基准以某种方式被打破之后 - 一次迭代 2.5 个周期? 使用 -server 你可能会发现它变得更加愚蠢。
Some ugly, unrealistic code (the stuff of microbenchmarks):
So what is this doing and how fast should it run.
The inner loop increments counter 100 times, then the counter is decremented by 99. So an increment of 1. Note counter is a member variable of an outer class, so some overhead there. This is then run 10,000,000 times. So the inner loop is run 1,000,000,000 times.
A loop using to accessor methods, call it 25 cycles. 1,000,000,000 times at 1 GHz, gives 25s.
Hey, we predicted the SLOW time. The slow time is fast. The fast times are after the benchmark has been broken in some way - 2.5 cycles an iteration? Use -server and you might find it gets even more silly.
可能是类加载或本地方法的动态链接。 如果您使用以下 JVM 参数运行 Java(请参阅此处了解完整信息list),它将打印有关花费时间的信息:
-verbose:class -verbose:jni -verbose:gc -XX:+PrintCompilation
要准确找出每个measure()调用的位置start 和 end,在这些方法之间添加一些新类的初始化作为标记,以便
-verbose:class
将在日志中显示标记类加载的时间点。 请参阅此答案用于类似的测量。为了弄清楚你的代码到底做了什么,我修改了它,如下所示:
然后通过查看 JVM 何时加载那些 Mark1 等类,以下是结果。
第一次调用measure()时,总共加载了85个类,动态链接了11个本机方法,并且对5个方法进行了JIT编译:
第二次调用时,只有一个方法进行了JIT编译:
第三次调用时,没有正在进行的额外工作:
这是在使用 jdk1.6.0_11 的 Windows 上运行的,因此您的系统可能会有所不同。 例如,也许这些动态方法的链接之一在您的系统上异常缓慢。 否则所有的类加载都会变慢。 尝试查看日志,是否有异常长时间的暂停,或者所有这些操作是否同样缓慢。
Probably it's class loading or dynamic linking of native methods. If you run Java with the following JVM parameters (see here for full list), it will print information about what is taking the time:
-verbose:class -verbose:jni -verbose:gc -XX:+PrintCompilation
To find out exactly where each of the measure() calls start and end, add initializations of some new classes between those methods as markers, so that
-verbose:class
will show at what point in the logs the marker class is loaded. See this answer for a similar measurement.To find out exactly what your code does, I modified it like this:
Then by looking at when the JVM loaded those Mark1 etc. classes, here are the results.
During the first call to measure(), a total of 85 classes were loaded, 11 native methods were dynamically linked and 5 methods were JIT compiled:
During the second call, only one method was JIT compiled:
During the third call, there was no additional work happening:
This was run on Windows with jdk1.6.0_11, so your system might do things a little bit differently. For example, maybe one of those linkings of dynamic methods is exceptionally slow on your system. Or then all of the class loading is slower. Try to have a look at the logs, whether there is an unusually long pause, or whether all of those operations are equally slow.
添加类加载作为嫌疑人。 类在第一次引用时延迟加载。 因此,代码第一次运行时,您可能是第一次引用某些类。
Add class loading in as a suspect. Classes are loaded lazily on first reference. So the first time the code runs, you're probably referencing some classes for the first time.
验证 JIT 编译器是否是后续迭代中加速的原因的最佳方法是在关闭 JIT 编译器的情况下运行基准测试。 为此,请指定系统属性
java.compiler=NONE
(单词“none”必须大写)。类加载所花费的时间也会导致基准测试代码第一次运行速度变慢。 最后,在调用 Thread.start() 和调用 Thread 的 run() 方法之间存在不确定性延迟。
您可能需要考虑寻找一个基准框架。 一个好的框架将通过运行多次迭代来“预热”代码,然后使用不同数量的迭代进行多次计时。 请参阅 Java 理论与实践:有缺陷的微基准剖析。
The best way to verify if the JIT compiler is the reason for the speedup in later iterations is to run the benchmark with the JIT compiler turned off. To do this, specify the system property
java.compiler=NONE
(the word "none" must be in upper case).Time spent doing class loading can also cause the benchmarked code to run slower the first time. Finally, there is a nondeterministic delay between calling Thread.start() and the Thread's run() method being called.
You might want to consider finding a benchmark framework. A good framework will "warm up" the code by running several iterations, then do multiple timings with a different number of iterations. See Java theory and practice: Anatomy of a flawed microbenchmark.
这是一个有趣的问题。 我怀疑 JIT 编译器,但这些是我的数字:
可能 Solaris 正在用线程做一些有趣的事情; 您是否尝试过 nThreads = 10 左右?
That's an interesting question. I'd suspect the JIT compiler, but these are my numbers:
Possibly Solaris is doing something funny with threads; have you tried with
nThreads = 10
or so?我建议您设置 nThread = Runtime.getRuntime().availableProcessors() 这将为您提供使用系统中所有核心的最佳线程数。
您可以尝试关闭 JIT,看看会有什么不同。
I suggest you make nThread = Runtime.getRuntime().availableProcessors() This will give you the optimal number of threads to use all the cores in your system.
You can try turning off the JIT to see what difference it makes.
您可以让虚拟机记录有关类加载和编译的信息,请尝试以下虚拟机参数:
-XX:+打印编译
-XX:+TraceClassLoading 加载跟踪类
这可能会提供一些关于幕后发生的事情的进一步线索。
编辑:我不确定这些选项在 java 1.5 中是否有效(我在 1.6 中使用过它们)。 我会尝试检查...
再次编辑:它在 java 1.5 中工作(注意你需要 +,而不是 -,或者你关闭该选项......)
You can get the VM to log information about classloading and compilation, try the following VM args:
-XX:+PrintCompilation
-XX:+TraceClassLoading
This might give some further clues as to what's happening under the hood.
EDIT: I'm not sure those options work in java 1.5 (I've used them in 1.6). I'll try to check...
EDIT again: It works in java 1.5 (note you need +, not -, or you turn the option off...)
我相信您还可以使用 -Xint 的 java 命令的非标准选项来禁用 HotSpot 并仅解释您的代码。 这至少可以将 HotSpot 从解释您的时间的等式中剔除。
I believe you can also use the non-standard option for the java command of -Xint to disable HotSpot and have your code interpreted only. This could at least take HotSpot out of the equation for interpreting your timing.
这是正在工作的热点编译器。 AFAIK,第一次运行函数时,运行“解释”并分析执行路径,然后JIT编译器可以优化后续函数调用。
It's the hotspot compiler at work. AFAIK, the first time it runs the function, runs "interpreted" and the execution path is analyzed, then the JIT compiler can optimize the subsequent function calls.
它肯定是热点编译器。 如果您在 64 位 Solaris 上运行,则默认为服务器 VM,并且热点仅在第一次执行时开始优化。 在客户端虚拟机上,代码可能需要在热点启动之前运行几次。(我相信Solaris只有有服务器虚拟机,但我可能是错的)
It's most certainly the hotspot compiler. If you're running on 64 bit solaris it defaults to the server VM and hotspot just start optimizing on first execution. On the client VM the code may need to run a few times before hotspot kicks in. (i believe solaris only has the server vm but I may be wrong)
请参阅 http://java.sun.com /javase/6/docs/technotes/guides/vm/server-class.html 了解启动器如何在客户端和服务器 VM 之间进行选择,以及不同处理器和操作系统支持的内容。
See http://java.sun.com/javase/6/docs/technotes/guides/vm/server-class.html for how the launcher selects between client and server VM, and what is supported on the different processors and OSes.