我的程序的多线程负加速
在我的配备 Intel Pentium 双核处理器 T2370 (Acer Extensa) 的笔记本电脑上,我运行了一个简单的多线程加速测试。 我正在使用Linux。 代码粘贴在下面。 虽然我预计速度会提高 2-3 倍,但我很惊讶地发现速度下降为 2 倍。我尝试使用 gcc 优化级别 -O0 ... -O3 进行相同的操作,但每次我得到了同样的结果。 我正在使用 pthreads。 我也仅使用两个线程(而不是代码中的 3 个线程)进行了相同的尝试,但性能相似。
可能是什么原因? 更快的版本花费了相当长的时间——大约 20 秒——所以这似乎不是启动开销的问题。
注意:这段代码有很多错误(实际上它没有多大意义,因为串行和并行版本的输出会不同)。 目的只是为了“获得”相同数量指令的加速比较。
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您报告的时间是使用时钟功能测量的:
多处理器任务的实时时间会更少,但处理器时间通常会更长。
当您使用多线程时,工作可能由多个处理器完成,但工作量是相同的,此外可能还存在一些开销,例如争用有限资源。
clock()
测量总处理器时间,即工作量 + 任何争用开销。 因此,它永远不应该少于在单个线程中完成工作的处理器时间。从这个问题中很难判断您是否知道这一点,并且惊讶于
clock()
返回的值是单个线程的两倍,而不是仅仅多一点,或者您是期望它会更少。使用
clock_gettime()
代替(您需要实时库 librt、g++ -lrt
等)给出:这仍然比人们希望的要少,但至少数字是有意义的。
100000000*20/2.5s = 800Hz,总线频率为 1600 MHz,因此我怀疑每次迭代都会进行读取和写入(假设有一些缓存),正如 tstenner 所建议的那样,内存带宽受到限制,并且时钟() 值表明大部分时间您的某些处理器都在等待数据。 (有谁知道
clock()
时间是否包括这样的停顿?)The times you are reporting are measured using the clock function:
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources.
clock()
measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by
clock()
was twice that for a single thread rather than being only a little more, or you were expecting it to be less.Using
clock_gettime()
instead (you'll need the realtime library librt,g++ -lrt
etc.) gives:which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the
clock()
value shows that most of the time some of your processors are waiting for data. (does anyone know whetherclock()
time includes such stalls?)您的线程所做的唯一事情就是添加一些元素,因此您的应用程序应该是 IO 绑定的。 当你添加一个额外的线程时,你有 2 个 CPU 共享内存总线,所以它不会运行得更快,相反,你会出现缓存未命中等情况。
The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.
我相信你的算法本质上使你的缓存毫无用处。
您可能看到的是三个线程之间引用的(非)局部性的影响。 本质上,因为每个线程都在与其他线程相距很远的不同数据部分上进行操作,所以当一个线程的数据部分替换了缓存中另一个线程的数据部分时,您会导致缓存未命中。 如果您的程序被构造为使得线程对较小的数据部分(以便它们都可以保存在内存中)或更靠近的数据部分进行操作(以便所有线程可以使用相同的缓存内页面),您会看到性能提升。 事实上,我怀疑您的速度变慢是因为必须从主内存而不是从缓存满足大量内存引用。
I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.
与您的线程问题无关,但您的代码中存在边界错误。
你有:
当
i
为零时你会做Not related to your threading issues, but there is a bounds error in your code.
You have:
When
i
is zero you will be doing另请参阅 herb 的文章,了解如何操作多CPU和缓存行干扰多线程代码,特别是“所有共享都是不好的——即使是“非共享”对象......”部分
Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'
正如其他人指出的那样,线程不一定能提高速度。 在此特定示例中,每个线程花费的时间量明显少于执行上下文切换和同步所需的时间量。
As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.
tstenner 说得基本正确。
这主要是操作系统“分配和映射新页面”算法的基准。 该数组分配分配了 800MB 的虚拟内存; 操作系统在需要之前不会真正分配真正的物理内存。 “分配和映射新页面”通常受到互斥体的保护,因此更多的内核没有帮助。
您的基准测试还强调内存总线(最小传输 800MB;在操作系统上,在将内存分配给您之前将其清零,最坏的情况是 800MB * 7 传输)。 如果瓶颈是内存总线,那么添加更多核心并没有真正的帮助。
您有 3 个线程正在遍历同一内存。 高速缓存行由不同的线程读取和写入,因此将在两个 CPU 内核上的 L1 高速缓存之间进行乒乓操作。 (要写入的缓存行只能位于一个 L1 缓存中,并且该缓存必须是附加到执行写入操作的 CPU 代码的 L1 缓存)。 这不是很有效。 CPU 核心可能花费大部分时间等待缓存行传输,这就是为什么线程处理比单线程处理慢的原因。
顺便说一句,代码也有错误,因为读取和读取相同的数组。 从不同的 CPU 写入,无需锁定。 正确的锁定会对性能产生影响。
tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.
当您拥有正确的矢量实现时,线程将带您进入速度提升的乐土(TM)。 这意味着您需要:
很难提出第一个。 您需要能够拥有冗余并确保它不会影响您的性能,正确合并数据以处理下一批数据等等......
但这只是一个理论观点。
当您只有一个处理器和糟糕的算法时,运行多个线程不会给您带来太多好处。 请记住 - 只有一个处理器,因此您的线程必须等待一个时间片,并且本质上您正在执行顺序处理。
Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.