为什么Java比C++在这个简单的Bubblesort基准示例中?
我从同事那里听说,C ++比Java快,并且在寻找最佳表现时,尤其是对于金融应用程序时,这就是走路的路线。但是我的观察结果有些不同。谁能指出我的实验失败或在讨论中添加一些科学变量?
Note1:我使用 -O3 (最大优化)和 -O2 与C ++编译器。
Note2:包括每种语言的简短和简单完整的源代码。随意在自己的机器上运行,进行更改,得出结论并分享。
Note3:如果您将两个源代码并排放在编辑器中,您将看到它们的实现是 emor equalent 。
更新:我尝试了clang ++
和g ++
带有多种优化选项(-O2
,> -o3
,-os
,-march =本机
等),它们的结果都比Java慢。我认为,在这一点上,要使C ++更快,我必须深入研究生成的装配代码,并进行一些汇编编程。我想知道在编码大型现实生活应用程序时,这种方法(组装编程和汇编调试)有多实用。
基准有什么作用?
- 在堆中创建一个int阵列(不在堆栈中)
- 启动时钟
- 填充阵列排序阵列
- 排序阵列用气泡排序
- 停止时钟
进行1000万次,丢弃了前100万次,以升温并输出平均水平,最小和最大时间。
对于 c ++ 我得到:(使用-O3和-O2)
$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1202 | Min Time: 1158 | Max Time: 212189
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1337 | Min Time: 1307 | Max Time: 36650
对于 java i得到:
$ java -version
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)
$ javac -cp . TimeBubbleSort.java
$ java -cp . TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 837.0 | Min Time: 812 | Max Time: 37196
完整的C ++代码:
#include <iostream>
#include <limits>
#include <sstream>
using namespace std;
// TO COMPILE: g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
// TO EXECUTE: ./TimeBubbleSort 10000000 1000000 60
long get_nano_ts(timespec* ts) {
clock_gettime(CLOCK_MONOTONIC, ts);
return ts->tv_sec * 1000000000 + ts->tv_nsec;
}
struct mi {
long value;
};
void swapping(int &a, int &b) {
int temp;
temp = a;
a = b;
b = temp;
}
void bubbleSort(int *array, int size) {
for(int i = 0; i < size; i++) {
bool swaps = false;
for(int j = 0; j < size - i - 1; j++) {
if(array[j] > array[j+1]) {
swapping(array[j], array[j+1]);
swaps = true;
}
}
if (!swaps) break;
}
}
void doSomething(int *array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
int main(int argc, char* argv[]) {
int iterations = stoi(argv[1]);
int warmup = stoi(argv[2]);
int arraySize = stoi(argv[3]);
struct timespec ts;
long long x = 0;
long long totalTime = 0;
int minTime = numeric_limits<int>::max();
int maxTime = numeric_limits<int>::min();
int * array = (int*) malloc(arraySize * sizeof(int));
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(&ts);
doSomething(array, arraySize);
long end = get_nano_ts(&ts);
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = end - start;
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = min(minTime, res);
maxTime = max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
cout << "Value computed: " << x << endl;
stringstream ss;
ss << "Iterations: " << count << " | Avg Time: " << avg;
if (count > 0) {
ss << " | Min Time: " << minTime << " | Max Time: " << maxTime;
}
cout << ss.str() << endl << endl;
free(array);
return 0;
}
完整的Java代码:
public class TimeBubbleSort {
// javac -cp . TimeBubbleSort.java
// java -cp . TimeBubbleSort 10000000 1000000 60
private static void swapping(int[] array, int x, int y) {
int temp = array[x];
array[x] = array[y];
array[y] = temp;
}
private static void bubbleSort(int[] array, int size) {
for(int i = 0; i < size; i++) {
int swaps = 0; // flag to detect any swap is there or not
for(int j = 0; j < size - i - 1; j++) {
if (array[j] > array[j + 1]) { // when the current item is bigger than next
swapping(array, j, j + 1);
swaps = 1;
}
}
if (swaps == 0) break; // No swap in this pass, so array is sorted
}
}
private final static void doSomething(int[] array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
public static void main(String[] args) {
int iterations = Integer.parseInt(args[0]);
int warmup = Integer.parseInt(args[1]);
int arraySize = Integer.parseInt(args[2]);
long x = 0;
long totalTime = 0;
long minTime = Long.MAX_VALUE;
long maxTime = Long.MIN_VALUE;
int[] array = new int[arraySize];
for(int i = 0; i < iterations; i++) {
long start = System.nanoTime();
doSomething(array, arraySize);
long end = System.nanoTime();
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = (int) (end - start);
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = Math.min(minTime, res);
maxTime = Math.max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
StringBuilder sb = new StringBuilder();
sb.append("Value computed: ").append(x).append("\n");
sb.append("Iterations: ").append(count).append(" | Avg Time: ").append(avg);
if (count > 0) {
sb.append(" | Min Time: ").append(minTime).append(" | Max Time: ").append(maxTime);
}
System.out.println(sb.toString() + "\n");
}
}
I hear from colleagues that C++ is faster than Java and when looking for top performance, especially for finance applications, that's the route to go. But my observations differ a bit. Can anyone point out failures on my experiment or add some scientific variables to the discussion?
Note1: I am using -O3 (maximum optimization) and -O2 with the C++ compiler.
Note2: The short and simple complete source codes for each language are included. Feel free to run on your own machine, make changes, draw conclusions and share.
Note3: If you put both source codes side by side in an editor, you will see that their implementations are equivalent.
UPDATE: I've tried clang++
and g++
with a variety of optimization options (-O2
, -O3
, -Os
, -march=native
, etc) and they all have produced slower results than Java. I think at this point to make C++ faster I have to dive into the generated assembly code and do some assembly programming. I'm wondering how practical is this approach (assembly programming and assembly debugging) when coding a large real-life application.
What does the benchmark do?
- Create an int array in the heap (not in the stack)
- Start the clock
- Populate the array
- Sort the array with bubble sort
- Stop the clock
Do that 10 million times, discard the first 1 million for warming up and output the average, min and max time.
For C++ I get: (with -O3 and -O2)
$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1202 | Min Time: 1158 | Max Time: 212189
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1337 | Min Time: 1307 | Max Time: 36650
For Java I get:
$ java -version
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)
$ javac -cp . TimeBubbleSort.java
$ java -cp . TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 837.0 | Min Time: 812 | Max Time: 37196
Full C++ code:
#include <iostream>
#include <limits>
#include <sstream>
using namespace std;
// TO COMPILE: g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
// TO EXECUTE: ./TimeBubbleSort 10000000 1000000 60
long get_nano_ts(timespec* ts) {
clock_gettime(CLOCK_MONOTONIC, ts);
return ts->tv_sec * 1000000000 + ts->tv_nsec;
}
struct mi {
long value;
};
void swapping(int &a, int &b) {
int temp;
temp = a;
a = b;
b = temp;
}
void bubbleSort(int *array, int size) {
for(int i = 0; i < size; i++) {
bool swaps = false;
for(int j = 0; j < size - i - 1; j++) {
if(array[j] > array[j+1]) {
swapping(array[j], array[j+1]);
swaps = true;
}
}
if (!swaps) break;
}
}
void doSomething(int *array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
int main(int argc, char* argv[]) {
int iterations = stoi(argv[1]);
int warmup = stoi(argv[2]);
int arraySize = stoi(argv[3]);
struct timespec ts;
long long x = 0;
long long totalTime = 0;
int minTime = numeric_limits<int>::max();
int maxTime = numeric_limits<int>::min();
int * array = (int*) malloc(arraySize * sizeof(int));
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(&ts);
doSomething(array, arraySize);
long end = get_nano_ts(&ts);
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = end - start;
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = min(minTime, res);
maxTime = max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
cout << "Value computed: " << x << endl;
stringstream ss;
ss << "Iterations: " << count << " | Avg Time: " << avg;
if (count > 0) {
ss << " | Min Time: " << minTime << " | Max Time: " << maxTime;
}
cout << ss.str() << endl << endl;
free(array);
return 0;
}
Full Java code:
public class TimeBubbleSort {
// javac -cp . TimeBubbleSort.java
// java -cp . TimeBubbleSort 10000000 1000000 60
private static void swapping(int[] array, int x, int y) {
int temp = array[x];
array[x] = array[y];
array[y] = temp;
}
private static void bubbleSort(int[] array, int size) {
for(int i = 0; i < size; i++) {
int swaps = 0; // flag to detect any swap is there or not
for(int j = 0; j < size - i - 1; j++) {
if (array[j] > array[j + 1]) { // when the current item is bigger than next
swapping(array, j, j + 1);
swaps = 1;
}
}
if (swaps == 0) break; // No swap in this pass, so array is sorted
}
}
private final static void doSomething(int[] array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
public static void main(String[] args) {
int iterations = Integer.parseInt(args[0]);
int warmup = Integer.parseInt(args[1]);
int arraySize = Integer.parseInt(args[2]);
long x = 0;
long totalTime = 0;
long minTime = Long.MAX_VALUE;
long maxTime = Long.MIN_VALUE;
int[] array = new int[arraySize];
for(int i = 0; i < iterations; i++) {
long start = System.nanoTime();
doSomething(array, arraySize);
long end = System.nanoTime();
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = (int) (end - start);
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = Math.min(minTime, res);
maxTime = Math.max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
StringBuilder sb = new StringBuilder();
sb.append("Value computed: ").append(x).append("\n");
sb.append("Iterations: ").append(count).append(" | Avg Time: ").append(avg);
if (count > 0) {
sb.append(" | Min Time: ").append(minTime).append(" | Max Time: ").append(maxTime);
}
System.out.println(sb.toString() + "\n");
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在这两种情况下,数组都以降序填充数字。然后泡泡排序,以便代码的行为应有所不同。但是您分配阵列的方式是不同的。
在C ++中,您
malloc
。这只是导致内核记录您已请求一些内存,但没有将任何物理内存映射到地址。因此,一旦时钟启动,您就开始初始化阵列,每个页面都会导致页面故障,然后内核映射物理页面AD正确的地址。在Java中,尽管您将数组分配并初始化为0。这会导致所有物理页面被映射,并且现在的内存位于CPU缓存中。
因此,当您开始时钟时,阵列的初始化要快得多。
但是我想这就是热身应该照顾的。
就是说您的测试方法存在缺陷。除
get_nano_ts()
调用外,C ++编译器可以优化整个循环。因此,您的C ++代码基本上将是非常接近
mintime = 1; maxtime = 1; totaltime =迭代 - 热身;
为什么您将一个时间计算为1的时间?如果排序甚至不采用纳秒,您应该流产,因为您的测试案例太小了,无法进行时钟的准确性。
让我们讨论您测量的结果:
您将完全相同的数组与9000000次完全相同的数组进行排序。因此,该算法每次都应表现相同,并且每次运行都应完全相同。然而,您测量的时间差异超过2个次数。在某些情况下,这种情况比其他情况(Java 40倍)花了200倍。
让我们看看我多次重复测试时会发生什么?
仅进行多次跑步显示最大时间更改50%。至少是最小的时间和公平。时间相对稳定。因此,似乎很少有操作系统会中断该过程并将其混合到另一个CPU核心,从而导致所有缓存丢失,然后执行时间糟透了。
让我们稍微播放编译器标志:
嘿,优化少,速度更快5倍。请注意,最初的Java仅比C ++快。当然,C ++代码现在必须击败Java的地狱。
让我们走得更远:
如果有的话,优化尺寸几乎没有速度会有所不同。我想说的是低于噪音的水平。可能只是代码或其他内容的不同缓存对齐的人工制品。
或者让我们尝试clang ++:
回答我的答案,我完全错过了指出,GCC经常
-O3
代码比-O2
慢。在大多数情况下,许多优化器选项都在-O3
中,是因为它们通常不会更快。否则,它们将在-O2
中。 (不包括任何尚未被认为稳定的实验优化)。不要使用
-O3
,除非您已经测试了它实际上是有益的,然后非常选择性地使用-O3
进行编译的代码的哪一部分。查看clang ++输出使我重新考虑这一点。不同的编译器,不同的优化器,-OS / -O2 / -O3的不同行为。
现在真正的工作开始了:编译器生成什么代码会产生这种不同的代码? “ GCC -O3”的速度慢了5倍,对于“ clang ++ -O3”而言,速度是两倍。
对于我的gcc11,答案是 bubble and bubble and bubble ant-o3远低于-O3 -o2带有GCC
-O3
放慢速度是一种非常特定的反优化,通常会有所帮助或至少没有太大伤害,但是在这里,它在泡沫中很受伤,而不是't保持阵列[J+1]
在临时的临时,是下一个迭代的array [j]
。而是从内存中重新加载它,作为一对负载的一对负载的一部分,创建了一个储藏式摊位。但是,您的GCC版本没有这个问题,只有GCC11和更新。因此,不要指望很大的加速;您的gcc7
-o3
应该已经在没有重大问题的情况下制作ASM,除了我如何减轻英特尔JCC Erratum对GCC的影响?如果您使用的是Skylake CPU。(两个元素的存储和重新加载仍将创建一个循环的依赖性,但是在一个从一个数组到另一个数组中的元素将元素冒泡,而不仅仅是更新下一次迭代的寄存器
。不过,最好的版本,因此您可能会得到很大的加速。
In both cases the array is filled with numbers in descending order. And then bubble sorted so the code should behave differently. But the way you allocate the array is different.
In C++ you
malloc
. WHich just causes the kernel to record that you have requested some memory but doesn't map any physical memory to the addresses. So once the clock starts and you start initializing the array every page causes a page fault and then the kernel maps a physical page ad the right address.In Java though you allocate and initialize the array to 0. This causes all the physical pages to be mapped and also the memory now is in the cpu cache.
So when you start the clock the initialization of the array is much faster.
But I guess that is what the warmup should take care of.
That said your test method is flawed. The c++ compiler could optimize the whole loop away with the exception of the
get_nano_ts()
calls. So your C++ code would basically beThis would be very close to
minTime = 1; maxTime = 1; totalTime = iterations - warmup;
Why do you count a time of 0 as 1? If the sorting doesn't even take a nanosecond you should abort because your test case is by far too small for the accuracy of your clock.
Lets discuss the results you measured a bit:
You sort the exact same array with exactly the same numbers 9000000 times. So the algorithm should behave the same every time and on it's own every single run should take the exact same time. And yet the time you measure differs by more than 2 orders of magnitudes. That is the sort took 200 times longer in some cases than others (40 times for java).
Lets see what happens when I repeat the test multiple times?
Just doing multiple runs shows the max time to change by 50%. At least the Min Time and Avg. Time is relatively stable. So it seems to be that rarely the OS will interrupt the process and shuffle it to a different CPU core causing all the caches to be lost and then execution time sucks.
Lets play with the compiler flags a bit:
Hey, optimizing less and it's 5 times faster. Note that originally Java was just barely faster than c++. Surely C++ code must now beat the hell out of java.
Lets go even further:
Optimizing for size barely makes a difference in speed, if at all. I would say it's below the level of noise. Might be just an artefact of different cache alignment of the code or something.
Or lets try clang++:
Reading back over my answer I totally missed pointing out that with gcc frequently
-O3
code is slower than-O2
. For the most part the reason a lot of optimizer options are in-O3
is that they are generally not faster. Otherwise they would be in-O2
. (Excluding any experimental optimization that isn't considered stable enough yet).Don't use
-O3
unless you have tested that it is actually benefittial and then be very selective which part of the code you compile with-O3
.Looking at the clang++ output makes me rethink this. Different compiler, different optimizer, different behavior for -Os / -O2 / -O3.
Now the real work begins: What code do the compilers generate that make such a difference? 5 times slower for "gcc -O3" and twice as fast for "clang++ -O3".
For my GCC11, the answer is Bubble sort slower with -O3 than -O2 with GCC The
-O3
slowdown here is a pretty specific anti-optimization that would often help or at least not hurt much, but here it hurts a lot in a Bubble Sort that doesn't keeparray[j+1]
around in a temporary to be next iteration'sarray[j]
. Instead reloading it from memory as part of a pair of loads that it does with one wide load, creating a store-forwarding stall.Your GCC version doesn't have that problem, though, only GCC11 and newer. So don't expect a big speedup; your GCC7
-O3
should already be making asm with no major problems, except for possible things like How can I mitigate the impact of the Intel jcc erratum on gcc? if you're using a Skylake CPU.(Store and reload of both elements will still create a loop-carried dependency when bubbling an element from one of the array to the other, though, worse than just updating a register for the next iteration.)
Whatever clang is doing is better than GCC's best version, though, so you can probably get a big speedup with that.
鉴于JIT解决和射击死亡代码淘汰的良好程度,我强烈建议使用适当的基准线束工具重写两个基准测试,例如 https://github.com/openjdk/jmh 用于Java侧。
Given how good JIT is to resolve and fire dead-code-elimination I strongly suggest to rewrite both benchmarks using proper benchmark harness tools eg https://github.com/openjdk/jmh for Java side.
这种情况吸引了我一段时间。但是,从get_nano_ts更改为std:Chrono AS:
它将为您提供比Java版本更好的结果。令我震惊的是,C ++中的总体程序执行仍然比我的机器上的Java慢:
C ++
Java:
This situation has intrigued me for a while. However, changing from get_nano_ts to std:chrono as:
It will give you much better results than the Java version. What blowed my mind is that overall program execution in c++ is still slower than in Java on my machine:
C++
Java: