当我在四核计算机上使用多个线程时,为什么此代码没有看到任何显着的性能提升?
我编写了一些 Java 代码来了解有关 Executor 框架的更多信息。
具体来说,我编写了代码来验证 Collatz 假设 - 这表示如果您迭代地应用以下函数对于任何整数,最终都会得到 1:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH 尚未得到证实,我认为这将是了解 Executor 的好方法。每个线程都被分配了一个 [l,u] 范围的整数来进行检查。
具体来说,我的程序需要 3 个参数 - N(我要检查 CH 的数字)、RANGESIZE(线程必须处理的时间间隔的长度)和 NTHREAD(线程池的大小)。
我的代码工作正常,但我看到的加速比我预期的要少得多 - 当我从 1 个线程变为 4 个线程时,加速大约为 30%。
我的逻辑是,计算完全受 CPU 限制,并且每个子任务(检查 CH 的固定大小范围)花费的时间大致相同。
有人知道为什么我没有看到速度提高 3 到 4 倍吗?
如果您可以在增加线程数量(以及机器、JVM 和操作系统)时报告运行时间,那就太好了。
具体
运行时:
java -d64 -server -cp 。科拉茨 10000000 1000000 4 => 4 个线程,需要 28412 毫秒
java -d64 -server -cp 。科拉茨 10000000 1000000 1 => 1 个线程,需要 38286 毫秒
处理器:
四核 Intel Q6600,2.4GHZ,4GB。机器已卸载。
Java:
java版本“1.6.0_15” Java(TM) SE 运行时环境(版本 1.6.0_15-b03) Java HotSpot(TM) 64 位服务器 VM(版本 14.1-b02,混合模式)
操作系统:
Linuxquad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
代码:(我无法发布代码,我认为对于 SO 要求来说太长了,源代码可在 Google 文档
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
@Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}
I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
@Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
繁忙的等待可能是一个问题:
您可以使用
awaitTermination()
代替:Busy waiting can be a problem:
You can use
awaitTermination()
instead:您正在使用 BigInteger。它消耗了大量的寄存器空间。在编译器级别最有可能遇到的是寄存器溢出,这会使您的进程受到内存限制。
另请注意,当您对结果进行计时时,您没有考虑 JVM 分配线程和使用线程池所花费的额外时间。
当您使用常量字符串时,也可能会出现内存冲突。所有字符串都存储在共享字符串池中,因此它可能会成为瓶颈,除非java对此非常聪明。
总的来说,我不建议使用 Java 来处理此类事情。使用 pthreads 将是一个更好的方法。
You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.
正如@axtavt 回答的那样,忙碌的等待可能是一个问题。您应该首先解决这个问题,因为它是答案的一部分,但不是全部。它似乎对您的情况没有帮助(在 Q6600 上),因为由于某种原因它似乎在 2 个核心处遇到瓶颈,因此另一个可用于繁忙的循环,因此没有明显的减速,但在我的 Core i5 上显着加快了 4 线程版本的速度。
我怀疑就 Q6600 而言,您的特定应用程序受到可用共享缓存量或特定于该 CPU 架构的其他因素的限制。 Q6600有两个4MB L2缓存,这意味着CPU共享它们,并且没有L3缓存。在我的核心 i5 上,每个 CPU 都有一个专用的 L2 缓存(256K,然后还有一个更大的 8MB 共享 L3 缓存。每个 CPU 多 256K 缓存可能会有所不同......否则其他架构明智的做法会有所不同。
这里是运行 Collatz.java 的 Q6600 和 Core i5 750。
在我的工作 PC 上,它也是像您的一样的 Q6600 @ 2.4GHz,但具有 6GB RAM、Windows 7 64 位和 JDK 1.6.0_21* (64-位),这里是一些基本结果:
当然更快 - 但没有像您期望的那样在四分之一的时间内完成, 虽然大约只是一半多一点,稍后会详细介绍)。
甚至一半... ( 在我的 Core i5 750(4 核,无超线程)、4GB RAM、Windows 7 64 位、jdk 1.6.0_22(64 位)上:
当删除忙等待循环时,4 线程版本所用时间是 1 线程版本所用时间的 27%。好多了。显然,代码可以有效地利用 4 个核心...
您可能需要增加默认堆,以防万一发生垃圾收集并稍微减慢 4 线程版本的速度。这可能有帮助,也可能没有帮助。
至少在您的示例中,较大的工作单元大小可能会稍微扭曲您的结果...将其减半可能会帮助您接近至少 2 倍的速度,因为 4 个线程将在较长的时间内保持忙碌状态。我不认为 Q6600 在这个特定任务上会做得更好......无论是缓存还是其他一些固有的架构。
在所有情况下,我只是运行“java Collatz 10000000 500000 X”,其中 x = 指示的线程数。
我对 java 文件所做的唯一更改是将其中一个 println 转换为打印,因此每个工作单元运行 500000 次时的换行次数较少,因此我可以立即在控制台中看到更多结果,并且我放弃了繁忙的工作等待循环,这对 i5 750 很重要,但对 Q6600 没有影响。
As @axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 @ 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.
您可以尝试使用提交函数,然后观察返回的 Future,检查它们以查看线程是否已完成。
在关闭之前终止不会返回。
未来提交(可运行任务)
提交一个 Runnable 任务来执行并返回一个表示该任务的 Future。
已终止()
如果关闭后所有任务均已完成,则返回 true。
试试这个...
You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...