为什么我的多线程效率不高?
我设计了一个类,它使用不同数量的线程用整数填充数组,以便了解多线程的强大功能。但根据我的结果,没有...
想法:这个想法太用值“1”填充 100000000 个整数的数组。从 1 个线程开始(一个线程填充整个数组),然后递增直到 100 个线程(每个线程填充大小为 100000000/nbThreads 的子数组)
示例:使用 10 个线程,我创建 10 个线程,每个线程正在填充 10000000 个整数的数组。
这是我的代码:
public class ThreadedArrayFilling extends Thread{
private int start;
private int partitionSize;
public static int[] data;
public static final int SIZE = 100000000;
public static final int NB_THREADS_MAX = 100;
public static void main(String[] args){
data = new int[SIZE];
long startTime, endTime;
int partition, startIndex, j;
ThreadedArrayLookup[] threads;
for(int i = 1; i <= NB_THREADS_MAX; i++){
startTime = System.currentTimeMillis();
partition = SIZE / i;
startIndex = 0;
threads = new ThreadedArrayLookup[i];
for(j = 0; j < i; j++){
threads[j] = new ThreadedArrayLookup(startIndex, partition);
startIndex += partition;
}
for(j = 0; j < i; j++){
try {
threads[j].join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
endTime = System.currentTimeMillis();
System.out.println(i + " THREADS: " + (endTime - startTime) + "ms");
}
}
public ThreadedArrayFilling(int start, int size){
this.start = start;
this.partitionSize = size;
this.start();
}
public void run(){
for(int i = 0; i < this.partitionSize; i++){
data[this.start + i] = 1;
}
}
public static String display(int[] d){
String s = "[";
for(int i = 0; i < d.length; i++){
s += d[i] + ", ";
}
s += "]";
return s;
}
}
这是我的结果:
1 THREADS: 196ms
2 THREADS: 208ms
3 THREADS: 222ms
4 THREADS: 213ms
5 THREADS: 198ms
6 THREADS: 198ms
7 THREADS: 198ms
8 THREADS: 198ms
9 THREADS: 198ms
10 THREADS: 206ms
11 THREADS: 201ms
12 THREADS: 197ms
13 THREADS: 198ms
14 THREADS: 204ms
15 THREADS: 199ms
16 THREADS: 203ms
17 THREADS: 234ms
18 THREADS: 225ms
19 THREADS: 235ms
20 THREADS: 235ms
21 THREADS: 234ms
22 THREADS: 221ms
23 THREADS: 211ms
24 THREADS: 203ms
25 THREADS: 206ms
26 THREADS: 200ms
27 THREADS: 202ms
28 THREADS: 204ms
29 THREADS: 202ms
30 THREADS: 200ms
31 THREADS: 206ms
32 THREADS: 200ms
33 THREADS: 205ms
34 THREADS: 203ms
35 THREADS: 200ms
36 THREADS: 206ms
37 THREADS: 200ms
38 THREADS: 204ms
39 THREADS: 205ms
40 THREADS: 201ms
41 THREADS: 206ms
42 THREADS: 200ms
43 THREADS: 204ms
44 THREADS: 204ms
45 THREADS: 206ms
46 THREADS: 203ms
47 THREADS: 204ms
48 THREADS: 204ms
49 THREADS: 201ms
50 THREADS: 205ms
51 THREADS: 204ms
52 THREADS: 207ms
53 THREADS: 202ms
54 THREADS: 207ms
55 THREADS: 207ms
56 THREADS: 203ms
57 THREADS: 203ms
58 THREADS: 201ms
59 THREADS: 206ms
60 THREADS: 206ms
61 THREADS: 204ms
62 THREADS: 201ms
63 THREADS: 206ms
64 THREADS: 202ms
65 THREADS: 206ms
66 THREADS: 205ms
67 THREADS: 207ms
68 THREADS: 210ms
69 THREADS: 207ms
70 THREADS: 203ms
71 THREADS: 207ms
72 THREADS: 205ms
73 THREADS: 203ms
74 THREADS: 211ms
75 THREADS: 202ms
76 THREADS: 207ms
77 THREADS: 204ms
78 THREADS: 212ms
79 THREADS: 203ms
80 THREADS: 210ms
81 THREADS: 206ms
82 THREADS: 205ms
83 THREADS: 203ms
84 THREADS: 203ms
85 THREADS: 209ms
86 THREADS: 204ms
87 THREADS: 206ms
88 THREADS: 208ms
89 THREADS: 263ms
90 THREADS: 216ms
91 THREADS: 230ms
92 THREADS: 216ms
93 THREADS: 230ms
94 THREADS: 234ms
95 THREADS: 234ms
96 THREADS: 217ms
97 THREADS: 229ms
98 THREADS: 228ms
99 THREADS: 215ms
100 THREADS: 232ms
我错过了什么?
编辑:附加信息:
我的机器正在运行双核。
期望:
- 我期望看到 1 到 2 个线程之间的性能大幅提升(以利用双核),
- 我还期望看到大量线程之后性能会下降。
但这并没有验证我的期望。我的期望是错误的,还是我的算法有问题?
I've designed a class that fills an array with integers using a various number of threads, in order to see the power of multi threading. But according to my result, there is none...
The idea: The idea was too fill an array of 100000000 integers with the value "1". Starting with 1 thread (one threads fills the whole array) and incrementing it until 100 threads (each thread fills a sub array of size 100000000/nbThreads)
Example: With 10 threads, I create 10 threads and each is filling an array of 10000000 integers.
Here is my code:
public class ThreadedArrayFilling extends Thread{
private int start;
private int partitionSize;
public static int[] data;
public static final int SIZE = 100000000;
public static final int NB_THREADS_MAX = 100;
public static void main(String[] args){
data = new int[SIZE];
long startTime, endTime;
int partition, startIndex, j;
ThreadedArrayLookup[] threads;
for(int i = 1; i <= NB_THREADS_MAX; i++){
startTime = System.currentTimeMillis();
partition = SIZE / i;
startIndex = 0;
threads = new ThreadedArrayLookup[i];
for(j = 0; j < i; j++){
threads[j] = new ThreadedArrayLookup(startIndex, partition);
startIndex += partition;
}
for(j = 0; j < i; j++){
try {
threads[j].join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
endTime = System.currentTimeMillis();
System.out.println(i + " THREADS: " + (endTime - startTime) + "ms");
}
}
public ThreadedArrayFilling(int start, int size){
this.start = start;
this.partitionSize = size;
this.start();
}
public void run(){
for(int i = 0; i < this.partitionSize; i++){
data[this.start + i] = 1;
}
}
public static String display(int[] d){
String s = "[";
for(int i = 0; i < d.length; i++){
s += d[i] + ", ";
}
s += "]";
return s;
}
}
And here are my results:
1 THREADS: 196ms
2 THREADS: 208ms
3 THREADS: 222ms
4 THREADS: 213ms
5 THREADS: 198ms
6 THREADS: 198ms
7 THREADS: 198ms
8 THREADS: 198ms
9 THREADS: 198ms
10 THREADS: 206ms
11 THREADS: 201ms
12 THREADS: 197ms
13 THREADS: 198ms
14 THREADS: 204ms
15 THREADS: 199ms
16 THREADS: 203ms
17 THREADS: 234ms
18 THREADS: 225ms
19 THREADS: 235ms
20 THREADS: 235ms
21 THREADS: 234ms
22 THREADS: 221ms
23 THREADS: 211ms
24 THREADS: 203ms
25 THREADS: 206ms
26 THREADS: 200ms
27 THREADS: 202ms
28 THREADS: 204ms
29 THREADS: 202ms
30 THREADS: 200ms
31 THREADS: 206ms
32 THREADS: 200ms
33 THREADS: 205ms
34 THREADS: 203ms
35 THREADS: 200ms
36 THREADS: 206ms
37 THREADS: 200ms
38 THREADS: 204ms
39 THREADS: 205ms
40 THREADS: 201ms
41 THREADS: 206ms
42 THREADS: 200ms
43 THREADS: 204ms
44 THREADS: 204ms
45 THREADS: 206ms
46 THREADS: 203ms
47 THREADS: 204ms
48 THREADS: 204ms
49 THREADS: 201ms
50 THREADS: 205ms
51 THREADS: 204ms
52 THREADS: 207ms
53 THREADS: 202ms
54 THREADS: 207ms
55 THREADS: 207ms
56 THREADS: 203ms
57 THREADS: 203ms
58 THREADS: 201ms
59 THREADS: 206ms
60 THREADS: 206ms
61 THREADS: 204ms
62 THREADS: 201ms
63 THREADS: 206ms
64 THREADS: 202ms
65 THREADS: 206ms
66 THREADS: 205ms
67 THREADS: 207ms
68 THREADS: 210ms
69 THREADS: 207ms
70 THREADS: 203ms
71 THREADS: 207ms
72 THREADS: 205ms
73 THREADS: 203ms
74 THREADS: 211ms
75 THREADS: 202ms
76 THREADS: 207ms
77 THREADS: 204ms
78 THREADS: 212ms
79 THREADS: 203ms
80 THREADS: 210ms
81 THREADS: 206ms
82 THREADS: 205ms
83 THREADS: 203ms
84 THREADS: 203ms
85 THREADS: 209ms
86 THREADS: 204ms
87 THREADS: 206ms
88 THREADS: 208ms
89 THREADS: 263ms
90 THREADS: 216ms
91 THREADS: 230ms
92 THREADS: 216ms
93 THREADS: 230ms
94 THREADS: 234ms
95 THREADS: 234ms
96 THREADS: 217ms
97 THREADS: 229ms
98 THREADS: 228ms
99 THREADS: 215ms
100 THREADS: 232ms
What did I miss?
EDIT: Additional infos:
My machine is running a dual core.
Expectations:
- I was expecting to see a huge increase in performances between 1 and 2 threads (to make use of the dual core)
- I was also expecting to see a slowdown after that for a large number of threads.
But this verifies none of my expectations. Are my expectations false, or is this a problem with my algo?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于两个内核,您可能期望的最佳性能是 2 个线程,其时间仅为 1 个线程的一半。此后任何额外的线程只会产生无用的开销 - 假设您完全受 CPU 限制,但实际上并非如此。
问题是为什么从 1 个线程变为 2 个线程时没有看到任何改进。原因可能是你的程序不是CPU限制的,而是内存限制的。你的瓶颈是主内存访问,两个线程只是轮流写入主内存。实际的 CPU 核心大部分时间都不做任何事情。如果不是在大面积内存上执行少量实际工作,而是在少量内存上执行大量 CPU 密集型工作,您将看到预期的差异。因为这样每个 CPU 核心都可以在其缓存内完全工作。
With two cores, the best performance you could possibly expect is 2 threads taking half the time as one thread. Any additional threads are only creating useless overhead after that - assuming that you're completely CPU-bound, but you are actually not.
The question is why you're not seeing an improvement when going from 1 to 2 threads. And the reason is probably that your program is not CPU-bound, but memory-bound. Your bottleneck is main memory access, and the 2 threads are just taking turns writing to main memory. The actual CPU cores are doing nothing most of the time. You'll see the expected difference if instead of doing little actual work on a large area of memory you do a lot of CPU-intensive work on a small amount of memory. Because then each CPU core can work completel inside its cache.
当您的软件受 CPU 限制时,多线程非常高效:有很多应用程序是单线程的,您可以看到它们通过仅最大化一个核心的使用而痛苦地使用现代 CPU(这在 CPU 监视器中显示得非常清楚)。
然而,启动比可用(虚拟)CPU 数量更多的线程是没有意义的。
例如,正确执行数字运算的多线程应用程序确实会创建一些与 JVM 可用的(虚拟)CPU 数量相关的工作线程。
Multithreading is super efficient when your software is CPU-bound: there are a lot of applications which are mono-threaded and you can see them painfully underusing modern CPUs by maxxing only one core's usage (this appears very clearly in CPU monitors).
However there's no point in launching many more threads than the number of (virtual) CPUs available.
Correctly multi-threaded applications that do, for example, number crunching, do create a number of worker threads that is related to the number of (virtual) CPUs available to the JVM.
您在线程内执行的任务是如此之小,所用的时间超过了您的设置开销。
进行一些繁重的计算(例如,运行 PI 的近似值以放入数组中),您将看到多线程的好处,但最多只能近似于您的计算机拥有的核心数量。
或者执行一些等待外部操作的操作(从数据库读取数据、从网站抓取数据),只要其他线程在其他线程等待时执行一些有用的操作,这可能会提高性能。
The task you perform inside the thread is so tiny, the time used for that is outweighted by the overhead of your setup.
Do some heavy calculation (e.g. run an approximation of PI to put in the array) the you will see a benefit of multiple threads but only up to approximatly the number of cores your machine has.
Or do something that waits for something external (reading from a database, scratching data from a website) this might be more performant as long as other threads do something usefull while others are waiting.
两个线程(每个线程都有自己的 cpu 或核心)协同工作,完成一项任务的速度可能比只有一个线程完成所有工作的速度慢。两个核心都希望其 L1+L2 缓存将数据写入内存,这很好。然而,它们很快就会使公共 L3 缓存饱和,从而停止额外的写入,直到它成功地将更新的缓存行写入 RAM,从而释放它以接受新的写入。
换句话说,线程的目的不是执行任何处理,而是填充系统 RAM。系统 RAM 速度很慢,通过将单线程结果与两个线程的结果进行比较可以看出,写入 RAM 的容量已被一个线程用尽,因此两个线程的速度不可能更快。
您的线程非常小,很可能它们将驻留在 L1 缓存中,因此不需要从系统 RAM 中获取数据,这会妨碍您执行 RAM 写入的能力。无论您有 1 个线程还是 100 个线程尝试执行此操作,写入 RAM 的能力都是相同的。不过,线程越多,线程管理开销就越大。对于少数线程来说,这可以忽略不计,但对于每个额外的线程来说,这都会增加,并且最终会变得明显。
It is possible for two threads - each with its own cpu or core - working in unison, to complete a task slower than if just one thread did all the work. Both cores want their L1+L2 caches to write data to memory which is fine. However they soon saturate the common L3 cache in such a way that it stops additional writes until it has managed to write an updated cache line to RAM, thereby freeing it to accept new writes.
To put it another way the purpose of your threads is not to perform any processing to speak of but to fill system RAM. System RAM is slow and as you can see by comparing your one-thread result with that for two threads the write-to-RAM capacity is all used up with one thread and therefore cannot be faster with two threads.
Your threads are so small that in all probability they will reside in the L1 cache and therefore not require fetches from system RAM which would hamper your capacity to do RAM writes. Your ability to write to RAM is the same whether you have 1 or 100 threads trying to do it. The more threads you have though, the more thread administration overhead you will have. This is negligible for few threads but increases for every additional thread and will eventually become noticeable.