使用 java 创建快速/可靠的基准测试?
我正在尝试使用 java 创建基准测试。目前我有以下简单的方法:
public static long runTest(int times){
long start = System.nanoTime();
String str = "str";
for(int i=0; i<times; i++){
str = "str"+i;
}
return System.nanoTime()-start;
}
我目前在另一个多次发生的循环中多次执行此循环,并获取运行此方法所需的最小/最大/平均时间。然后我在另一个线程上开始一些活动并再次测试。基本上我只是想获得一致的结果...如果我有 runTest 循环 1000 万次,它看起来相当一致:
Number of times ran: 5
The max time was: 1231419504 (102.85% of the average)
The min time was: 1177508466 (98.35% of the average)
The average time was: 1197291937
The difference between the max and min is: 4.58%
Activated thread activity.
Number of times ran: 5
The max time was: 3872724739 (100.82% of the average)
The min time was: 3804827995 (99.05% of the average)
The average time was: 3841216849
The difference between the max and min is: 1.78%
Running with thread activity took 320.83% as much time as running without.
但这似乎有点多,并且需要一些时间...如果我尝试一个较低的数字(100000) runTest 循环...它开始变得非常不一致:
Number of times ran: 5
The max time was: 34726168 (143.01% of the average)
The min time was: 20889055 (86.02% of the average)
The average time was: 24283026
The difference between the max and min is: 66.24%
Activated thread activity.
Number of times ran: 5
The max time was: 143950627 (148.83% of the average)
The min time was: 64780554 (66.98% of the average)
The average time was: 96719589
The difference between the max and min is: 122.21%
Running with thread activity took 398.3% as much time as running without.
有没有一种方法可以让我做这样的基准测试,既一致又高效/快速?
顺便说一句,我没有测试开始时间和结束时间之间的代码。我正在以某种方式测试 CPU 负载(看看我如何启动一些线程活动并重新测试)。所以我认为我正在寻找一些东西来替代我在“runTest”中的代码,这将产生更快、更一致的结果。
谢谢
I'm trying to create a benchmark test with java. Currently I have the following simple method:
public static long runTest(int times){
long start = System.nanoTime();
String str = "str";
for(int i=0; i<times; i++){
str = "str"+i;
}
return System.nanoTime()-start;
}
I'm currently having this loop multiple times within another loop that is happening multiple times and getting the min/max/avg time it takes to run this method through. Then I am starting some activity on another thread and testing again. Basically I am just wanting to get consistent results... It seems pretty consistent if I have the runTest loop 10 million times:
Number of times ran: 5
The max time was: 1231419504 (102.85% of the average)
The min time was: 1177508466 (98.35% of the average)
The average time was: 1197291937
The difference between the max and min is: 4.58%
Activated thread activity.
Number of times ran: 5
The max time was: 3872724739 (100.82% of the average)
The min time was: 3804827995 (99.05% of the average)
The average time was: 3841216849
The difference between the max and min is: 1.78%
Running with thread activity took 320.83% as much time as running without.
But this seems a bit much, and takes some time... if I try a lower number (100000) in the runTest loop... it starts to become very inconsistent:
Number of times ran: 5
The max time was: 34726168 (143.01% of the average)
The min time was: 20889055 (86.02% of the average)
The average time was: 24283026
The difference between the max and min is: 66.24%
Activated thread activity.
Number of times ran: 5
The max time was: 143950627 (148.83% of the average)
The min time was: 64780554 (66.98% of the average)
The average time was: 96719589
The difference between the max and min is: 122.21%
Running with thread activity took 398.3% as much time as running without.
Is there a way that I can do a benchmark like this that is both consistent and efficient/fast?
I'm not testing the code that is between the start and end times by the way. I'm testing the CPU load in a way (see how I'm starting some thread activity and retesting). So I think that what I'm looking for it something to substitute for the code I have in "runTest" that will yield quicker and more consistent results.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简而言之:(
微)基准测试非常复杂,因此请使用基准测试框架之类的工具 http:// /www.ellipticgroup.com/misc/projectLibrary.zip - 仍然对结果持怀疑态度(“将微信任置于微基准中”,Cliff Click 博士)。
详细说明:
有很多因素会强烈影响结果:
Brent Boyer 的文章“Robust Java benchmarking,第 1 部分:问题”( http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html)很好地描述了所有这些问题以及您是否/可以采取哪些措施它们(例如使用 JVM 选项或预先调用 ProcessIdleTask)。
您无法消除所有这些因素,因此进行统计是个好主意。但是:
上述基准框架( http://www.ellipticgroup.com/misc/projectLibrary.zip) 使用这些技术。您可以在 Brent Boyer 的文章“稳健的 Java 基准测试,第 2 部分:统计数据和解决方案”中了解它们(https://www.ibm.com/developerworks/java/library/j-benchmark2/)。
In short:
(Micro-)benchmarking is very complex, so use a tool like the Benchmarking framework http://www.ellipticgroup.com/misc/projectLibrary.zip - and still be skeptical about the results ("Put micro-trust in a micro-benchmark", Dr. Cliff Click).
In detail:
There are a lot of factors that can strongly influence the results:
Brent Boyer's article "Robust Java benchmarking, Part 1: Issues" ( http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html) is a good description of all those issues and whether/what you can do against them (e.g. use JVM options or call ProcessIdleTask beforehand).
You won't be able to eliminate all these factors, so doing statistics is a good idea. But:
The above mentioned Benchmark framework ( http://www.ellipticgroup.com/misc/projectLibrary.zip) uses these techniques. You can read about them in Brent Boyer's article "Robust Java benchmarking, Part 2: Statistics and solutions" ( https://www.ibm.com/developerworks/java/library/j-benchmark2/).
您的代码最终主要测试垃圾收集性能,因为在循环中附加到 String 最终会创建并立即丢弃大量越来越大的 String 对象。
这本质上会导致测量结果发生巨大变化,并且受到多线程活动的强烈影响。
我建议您在循环中执行其他具有更可预测性能的操作,例如数学计算。
Your code ends up testing mainly garbage collection performance because appending to a String in a loop ends up creating and immediately discarding a large number of increasingly large String objects.
This is something that inherently leads to wildly varying measurements and is influenced strongy by multi-thread activity.
I suggest you do something else in your loop that has more predictable performance, like mathematical calculations.
在 1000 万次运行中,HotSpot 编译器很可能检测到一段“频繁使用”的代码并将其编译为机器本机代码。
JVM 字节码是解释性的,这导致它容易受到 JVM 中发生的其他后台进程(例如垃圾收集)的更多中断的影响。
一般来说,这类基准充满了不成立的假设。如果没有大量证据证明初始测量(时间)实际上并未测量您的任务以及可能的其他一些后台任务,您就无法相信微观基准测试确实证明了它所要证明的内容。如果您不尝试控制后台任务,那么测量的用处就小得多。
In the 10 million times run, odds are good the HotSpot compiler detected a "heavily used" piece of code and compiled it into machine native code.
JVM bytecode is interpreted, which leads it susceptible to more interrupts from other background processes occurring in the JVM (like garbage collection).
Generally speaking, these kinds of benchmarks are rife with assumptions that don't hold. You cannot believe that a micro benchmark really proves what it set out to prove without a lot of evidence proving that the initial measurement (time) isn't actually measuring your task and possibly some other background tasks. If you don't attempt to control for background tasks, then the measurement is much less useful.