我的程序的多线程负加速

发布于 2024-07-15 10:52:07 字数 3612 浏览 5 评论 0原文

在我的配备 Intel Pentium 双核处理器 T2370 (Acer Extensa) 的笔记本电脑上，我运行了一个简单的多线程加速测试。我正在使用Linux。代码粘贴在下面。虽然我预计速度会提高 2-3 倍，但我很惊讶地发现速度下降为 2 倍。我尝试使用 gcc 优化级别 -O0 ... -O3 进行相同的操作，但每次我得到了同样的结果。我正在使用 pthreads。我也仅使用两个线程（而不是代码中的 3 个线程）进行了相同的尝试，但性能相似。

可能是什么原因？更快的版本花费了相当长的时间——大约 20 秒——所以这似乎不是启动开销的问题。

注意：这段代码有很多错误（实际上它没有多大意义，因为串行和并行版本的输出会不同）。目的只是为了“获得”相同数量指令的加速比较。

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>

class Thread{
    private:
            pthread_t thread;
            static void *thread_func(void *d){((Thread *)d)->run();}
    public:
            Thread(){}
            virtual ~Thread(){}

            virtual void run(){}
            int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
            int wait(){return pthread_join(thread, NULL);}
};


#include <iostream>

const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];

int main(void)
{

    class Thread_a:public Thread{
            public:
                    Thread_a(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };
    class Thread_b:public Thread{
            public:
                    Thread_b(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    class Thread_c:public Thread{
            public:
                    Thread_c(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    {
            Thread *a=new Thread_a(arr);
            Thread *b=new Thread_b(arr);
            Thread *c=new Thread_c(arr);

            clock_t start = clock();

            if (a->start() != 0) {
                    return 1;
            }

            if (b->start() != 0) {
                    return 1;
            }
            if (c->start() != 0) {
                    return 1;
            }

            if (a->wait() != 0) {
                    return 1;
            }

            if (b->wait() != 0) {
                    return 1;
            }

            if (c->wait() != 0) {
                    return 1;
            }

            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << duration << "seconds\n";
            delete a;
            delete b;

    }
    {
            clock_t start = clock();
            for(int n = 0; n<N; n++)
            for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << "serial: " << duration << "seconds\n";
    }

    return 0;
  }

另请参阅：什么会使程序运行速度变慢当使用更多线程时？

原文

On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.

What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.

NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>

class Thread{
    private:
            pthread_t thread;
            static void *thread_func(void *d){((Thread *)d)->run();}
    public:
            Thread(){}
            virtual ~Thread(){}

            virtual void run(){}
            int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
            int wait(){return pthread_join(thread, NULL);}
};


#include <iostream>

const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];

int main(void)
{

    class Thread_a:public Thread{
            public:
                    Thread_a(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };
    class Thread_b:public Thread{
            public:
                    Thread_b(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    class Thread_c:public Thread{
            public:
                    Thread_c(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    {
            Thread *a=new Thread_a(arr);
            Thread *b=new Thread_b(arr);
            Thread *c=new Thread_c(arr);

            clock_t start = clock();

            if (a->start() != 0) {
                    return 1;
            }

            if (b->start() != 0) {
                    return 1;
            }
            if (c->start() != 0) {
                    return 1;
            }

            if (a->wait() != 0) {
                    return 1;
            }

            if (b->wait() != 0) {
                    return 1;
            }

            if (c->wait() != 0) {
                    return 1;
            }

            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << duration << "seconds\n";
            delete a;
            delete b;

    }
    {
            clock_t start = clock();
            for(int n = 0; n<N; n++)
            for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << "serial: " << duration << "seconds\n";
    }

    return 0;
  }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Bonjour°[大白 2024-07-22 10:52:07

您报告的时间是使用时钟功能测量的：

clock() 函数返回近似值程序使用的处理器时间。

$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds

real    0m5.247s
user    0m9.025s
sys 0m0.304s

多处理器任务的实时时间会更少，但处理器时间通常会更长。

当您使用多线程时，工作可能由多个处理器完成，但工作量是相同的，此外可能还存在一些开销，例如争用有限资源。 clock() 测量总处理器时间，即工作量 + 任何争用开销。因此，它永远不应该少于在单个线程中完成工作的处理器时间。

从这个问题中很难判断您是否知道这一点，并且惊讶于 clock() 返回的值是单个线程的两倍，而不是仅仅多一点，或者您是期望它会更少。

使用 clock_gettime() 代替（您需要实时库 librt、g++ -lrt 等）给出：

$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds

real    0m5.326s
user    0m9.057s
sys 0m0.344s

这仍然比人们希望的要少，但至少数字是有意义的。

100000000*20/2.5s = 800Hz，总线频率为 1600 MHz，因此我怀疑每次迭代都会进行读取和写入（假设有一些缓存），正如 tstenner 所建议的那样，内存带宽受到限制，并且时钟() 值表明大部分时间您的某些处理器都在等待数据。（有谁知道clock()时间是否包括这样的停顿？）

The times you are reporting are measured using the clock function:

The clock() function returns an approximation of processor time used by the program.

$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds

real    0m5.247s
user    0m9.025s
sys 0m0.304s

The real time will be less for multiprocessor tasks, but the processor time will typically be greater.

When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.

It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.

Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:

$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds

real    0m5.326s
user    0m9.057s
sys 0m0.344s

which still is less of a speed-up than one might hope for, but at least the numbers make some sense.

100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)

回复收藏 0 原文

枯寂 2024-07-22 10:52:07

您的线程所做的唯一事情就是添加一些元素，因此您的应用程序应该是 IO 绑定的。当你添加一个额外的线程时，你有 2 个 CPU 共享内存总线，所以它不会运行得更快，相反，你会出现缓存未命中等情况。

回复收藏 0 原文

格子衫的從容 2024-07-22 10:52:07

我相信你的算法本质上使你的缓存毫无用处。

您可能看到的是三个线程之间引用的（非）局部性的影响。本质上，因为每个线程都在与其他线程相距很远的不同数据部分上进行操作，所以当一个线程的数据部分替换了缓存中另一个线程的数据部分时，您会导致缓存未命中。如果您的程序被构造为使得线程对较小的数据部分（以便它们都可以保存在内存中）或更靠近的数据部分进行操作（以便所有线程可以使用相同的缓存内页面），您会看到性能提升。事实上，我怀疑您的速度变慢是因为必须从主内存而不是从缓存满足大量内存引用。

回复收藏 0 原文

对你的占有欲 2024-07-22 10:52:07

与您的线程问题无关，但您的代码中存在边界错误。
你有：

for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}

当i为零时你会做

arr[0] += arr[-1];

Not related to your threading issues, but there is a bounds error in your code.
You have:

for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}

When i is zero you will be doing

arr[0] += arr[-1];

回复收藏 0 原文

茶花眉 2024-07-22 10:52:07

另请参阅 herb 的文章，了解如何操作多CPU和缓存行干扰多线程代码，特别是“所有共享都是不好的——即使是“非共享”对象......”部分

回复收藏 0 原文

唐婉 2024-07-22 10:52:07

正如其他人指出的那样，线程不一定能提高速度。在此特定示例中，每个线程花费的时间量明显少于执行上下文切换和同步所需的时间量。

回复收藏 0 原文

只为守护你 2024-07-22 10:52:07

tstenner 说得基本正确。

这主要是操作系统“分配和映射新页面”算法的基准。该数组分配分配了 800MB 的虚拟内存；操作系统在需要之前不会真正分配真正的物理内存。 “分配和映射新页面”通常受到互斥体的保护，因此更多的内核没有帮助。

您的基准测试还强调内存总线（最小传输 800MB；在操作系统上，在将内存分配给您之前将其清零，最坏的情况是 800MB * 7 传输）。如果瓶颈是内存总线，那么添加更多核心并没有真正的帮助。

您有 3 个线程正在遍历同一内存。高速缓存行由不同的线程读取和写入，因此将在两个 CPU 内核上的 L1 高速缓存之间进行乒乓操作。（要写入的缓存行只能位于一个 L1 缓存中，并且该缓存必须是附加到执行写入操作的 CPU 代码的 L1 缓存）。这不是很有效。 CPU 核心可能花费大部分时间等待缓存行传输，这就是为什么线程处理比单线程处理慢的原因。

顺便说一句，代码也有错误，因为读取和读取相同的数组。从不同的 CPU 写入，无需锁定。正确的锁定会对性能产生影响。

回复收藏 0 原文