多线程并不比单线程快(简单循环测试)
我正在尝试一些多线程结构,但不知何故,多线程似乎并不比单线程快。我将其范围缩小到一个非常简单的测试,其中包含一个嵌套循环 (1000x1000),其中系统仅计数。
下面我发布了单线程和多线程的代码以及它们的执行方式。
结果是,单个线程在大约 110 毫秒内完成循环,而两个线程也大约需要112 毫秒。
我不认为问题在于多线程的开销。如果我只将两个 Runnable 之一提交给 ThreadPoolExecutor,它的执行时间是单个线程的一半,这是有道理的。但是添加第二个 Runnable 会使速度慢 10 倍。两个 3.00Ghz 核心均 100% 运行。
我认为这可能是特定于电脑的,因为其他人的电脑在多线程上显示出双倍速度的结果。但那么,我能做什么呢?我有 Intel Pentium 4 3.00GHz(2 个 CPU)和 Java jre6。
测试代码:
// Single thread:
long start = System.nanoTime(); // Start timer
final int[] i = new int[1]; // This is to keep the test fair (see below)
int i = 0;
for(int x=0; x<10000; x++)
{
for(int y=0; y<10000; y++)
{
i++; // Just counting...
}
}
int i0[0] = i;
long end = System.nanoTime(); // Stop timer
此代码执行时间约为110 ms。
// Two threads:
start = System.nanoTime(); // Start timer
// Two of the same kind of variables to count with as in the single thread.
final int[] i1 = new int [1];
final int[] i2 = new int [1];
// First partial task (0-5000)
Thread t1 = new Thread() {
@Override
public void run()
{
int i = 0;
for(int x=0; x<5000; x++)
for(int y=0; y<10000; y++)
i++;
i1[0] = i;
}
};
// Second partial task (5000-10000)
Thread t2 = new Thread() {
@Override
public void run()
{
int i = 0;
for(int x=5000; x<10000; x++)
for(int y=0; y<10000; y++)
i++;
int i2[0] = i;
}
};
// Start threads
t1.start();
t2.start();
// Wait for completion
try{
t1.join();
t2.join();
}catch(Exception e){
e.printStackTrace();
}
end = System.nanoTime(); // Stop timer
此代码的执行时间约为112 毫秒。
编辑:我将 Runnables 更改为线程,并摆脱了 ExecutorService(为了简化问题)。
编辑:尝试了一些建议
I'm experimenting with some multithreading constructions, but somehow it seems that multithreading is not faster than a single thread. I narrowed it down to a very simple test with a nested loop (1000x1000) in which the system only counts.
Below I posted the code for both single threading and multithreading and how they are executed.
The result is that the single thread completes the loop in about 110 ms, while the two threads also take about 112 ms.
I don't think the problem is the overhead of multithreading. If I only submit one of both Runnables to the ThreadPoolExecutor, it executes in half the time of the single thread, which makes sense. But adding that second Runnable makes it 10 times slower. Both 3.00Ghz cores are running 100%.
I think it may be pc-specific, as someone else's pc showed double-speed results on the multithreading. But then, what can I do about it? I have a Intel Pentium 4 3.00GHz (2 CPUs) and Java jre6.
Test code:
// Single thread:
long start = System.nanoTime(); // Start timer
final int[] i = new int[1]; // This is to keep the test fair (see below)
int i = 0;
for(int x=0; x<10000; x++)
{
for(int y=0; y<10000; y++)
{
i++; // Just counting...
}
}
int i0[0] = i;
long end = System.nanoTime(); // Stop timer
This code is executed in about 110 ms.
// Two threads:
start = System.nanoTime(); // Start timer
// Two of the same kind of variables to count with as in the single thread.
final int[] i1 = new int [1];
final int[] i2 = new int [1];
// First partial task (0-5000)
Thread t1 = new Thread() {
@Override
public void run()
{
int i = 0;
for(int x=0; x<5000; x++)
for(int y=0; y<10000; y++)
i++;
i1[0] = i;
}
};
// Second partial task (5000-10000)
Thread t2 = new Thread() {
@Override
public void run()
{
int i = 0;
for(int x=5000; x<10000; x++)
for(int y=0; y<10000; y++)
i++;
int i2[0] = i;
}
};
// Start threads
t1.start();
t2.start();
// Wait for completion
try{
t1.join();
t2.join();
}catch(Exception e){
e.printStackTrace();
}
end = System.nanoTime(); // Stop timer
This code is executed in about 112 ms.
Edit: I changed the Runnables to Threads and got rid of the ExecutorService (for simplicity of the problem).
Edit: tried some suggestions
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您绝对不想继续轮询 Thread.isAlive() - 这会无缘无故地消耗大量 CPU 周期。请改用 Thread.join() 。
另外,让线程直接增加结果数组、缓存行等可能不是一个好主意。更新局部变量,并在计算完成后进行一次存储。
编辑:
完全忽略了您使用的是 Pentium 4。据我所知,P4 没有多核版本 - 为了给人多核的错觉,它有 超线程:两个逻辑核心共享一个物理核心的执行单元核心。如果您的线程依赖于相同的执行单元,您的性能将与单线程性能相同(或更差!)。例如,您需要在一个线程中进行浮点计算,在另一个线程中进行整数计算,以获得性能改进。
P4 HT 实现受到了很多批评,较新的实现(最近的 core2)应该更好。
You definitely don't want to keep polling
Thread.isAlive()
- this burns a lot of CPU cycles for no good reason. UseThread.join()
instead.Also, it's probably not a good idea having the threads increment the result arrays directly, cache lines and all. Update local variables, and do a single store when the computations are done.
EDIT:
Totally overlooked that you're using a Pentium 4. As far as I know, there's no multi-core versions of the P4 - to give the illusion of multicore, it has Hyper-Threading: two logical cores share the execution units of one physical core. If your threads depend on the same execution units, your performance will be the same as (or worse than!) single-threaded performance. You'd need, for instance, floating-point calculations in one thread and integer calcs in another to gain performance improvements.
The P4 HT implementation has been criticized a lot, newer implementations (recent core2) should be better.
尝试稍微增加数组的大小。不,真的。
在同一线程中顺序分配的小对象往往最初是顺序分配的。这可能位于同一个缓存行中。如果您有两个核心访问同一缓存行(然后 micro-benhcmark 本质上只是对同一地址进行一系列写入),那么它们将不得不争夺访问权。
java.util.concurrent
中有一个类,它有一堆未使用的long
字段。它们的目的是将不同线程可能频繁使用的对象分离到不同的缓存行中。Try increasing the size of the array somewhat. No, really.
Small objects allocated sequentially in the same thread will tend to be initially allocated sequentially. That's probably in the same cache line. If you have two cores access the same cache line (and then micro-benhcmark is essentially just doing a sequence of writes to the same address) then they will have to fight for access.
There's a class in
java.util.concurrent
that has a bunch of unusedlong
fields. Their purpose is to separate objects that may be frequently used by different threads into different cache lines.我对这种差异一点也不感到惊讶。您正在使用 Java 的并发框架来创建线程(尽管我看不到任何保证会创建两个线程,因为第一个作业可能会在第二个作业开始之前完成。
幕后可能会发生各种锁定和同步简而言之,我确实认为问题在于多线程的开销。
I'm not at all surprised at the difference. You are using Java's concurrency framework to create your threads (although I don't see any guarantee that two threads are even created since the first job might complete before the second even starts.
There's probably all sorts of locking and synchronisation going on behind the scenes which you don't actually need for your simple test. In short I do think the problem is the overhead of multithreading.
你没有对 i 做任何事情,所以你的循环可能只是被优化掉了。
You don't do anything with i, so your loop is probably just optimised away.
您是否使用 Runtime.getRuntime().availableProcessors() 检查了 PC 上的可用内核数量?
Have you checked the number of available cores on your PC with Runtime.getRuntime().availableProcessors() ?
您的代码只是增加一个变量 - 无论如何,这是一个非常快的操作。您并没有从这里使用多线程中获得太多好处。当线程 1 必须等待某些外部响应或执行一些更复杂的计算时,性能提升更加明显,同时您的主线程或其他线程可以继续处理并且不会等待。如果您计数更多或使用更多线程,您可能会获得更多收益(安全数字可能是计算机中的 CPU/内核数量)。
Your code simply increments a variable - this is a very fast operation anyways. You are not gaining much from the use of multiple threads here. Performance gains are more pronounced when thread-1 has to wait on some external response or do some more complex calculations, meanwhile your main thread or some other thread can continue processing and is not held up waiting. You might seem more gains if you counted higher or used more threads (probably a safe number is the number of CPU/cores in your machine).