当我使用并行代码时,为什么我的计算机没有显示加速?

发布于 2024-08-24 12:59:49 字数 227 浏览 5 评论 0原文

所以我意识到这个问题听起来很愚蠢(是的,我使用的是双核),但我尝试了两个不同的库(Grand Central Dispatch 和 OpenMP),并且当使用 Clock() 对带有和不带有使代码行进行计时的代码时平行的话,速度是一样的。 (根据记录,他们都使用自己的并行形式)。他们报告在不同的线程上运行,但也许他们在同一个核心上运行?有什么办法可以检查吗? (这两个库都是针对 C 的,我在较低层感到不舒服。)这太奇怪了。有什么想法吗?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

奶茶白久 2024-08-31 12:59:49

编辑:添加了 Grand Central Dispatch 的详细信息以响应 OP 评论。

虽然这里的其他答案通常很有用,但您问题的具体答案是您不应该使用 clock() 来比较计时。 clock() 测量跨线程累加的 CPU 时间。当您在核心之间拆分作业时,它至少会使用同样多的 CPU 时间(由于线程开销,通常会多一点)。在页面上搜索clock(),找到“如果进程是多线程的,CPU时间添加进程的所有单独线程消耗的内容。”

只是该作业是在线程之间分割的,因此您需要等待的总时间会更少。您应该使用挂钟时间(挂钟上的时间)。 OpenMP 提供了一个例程 omp_get_wtime() 来执行此操作。以下面的例程为例:

#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int i, nthreads;
    clock_t clock_timer;
    double wall_timer;
    for (nthreads = 1; nthreads <=8; nthreads++) {
        clock_timer = clock();
        wall_timer = omp_get_wtime();
        #pragma omp parallel for private(i) num_threads(nthreads)
        for (i = 0; i < 100000000; i++) cos(i);
        printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
            nthreads, \
            (double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
            omp_get_wtime() - wall_timer);
    }
}

结果是:

1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033

您可以看到 clock() 时间没有太大变化。在没有 pragma 的情况下,我得到 0.254,因此在一个线程中使用 openMP 比根本不使用 openMP 慢一点,但每个线程的挂起时间都会减少。

改进并不总是那么好,例如,由于部分计算不并行(请参阅 Amdahl's_law) 或不同线程争夺同一内存。

编辑:对于 Grand Central Dispatch,GCD 参考< /a> 指出,GCD 使用 gettimeofday 作为挂机时间。因此,我创建了一个新的 Cocoa 应用程序,并在 applicationDidFinishLaunching 中输入:

struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
    int stride = 1e8/iterations;
    gettimeofday(&t1,0);
    dispatch_apply(iterations, queue, ^(size_t i) { 
        for (int j = 0; j < stride; j++) cos(j); 
    });
    gettimeofday(&t2,0);
    NSLog(@"%d iterations: on wall = %.3f\n",iterations, \
                t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}

并在控制台上得到以下结果:

2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034

这与我上面得到的结果大致相同。

这是一个非常人为的例子。事实上,您需要确保将优化保持在 -O0,否则编译器将意识到我们没有保留任何计算并且根本不执行循环。另外,我在两个示例中采用的 cos 整数是不同的,但这不会对结果产生太大影响。请参阅 dispatch_apply 联机帮助页上的 STRIDE,了解如何正确执行此操作以及为什么 iterationsnum_threads 大致相当。 > 在这种情况下。

编辑:我注意到雅各布的回答包括

我使用 omp_get_thread_num()
我的并行循环中的函数
打印出正在工作的核心
上...这样您就可以确定
它在两个核心上运行。

这是不正确的(已通过编辑部分修复)。使用 omp_get_thread_num() 确实是确保代码是多线程的好方法,但它不会显示“它正在哪个核心上工作”,而只是显示哪个线程。例如,以下代码:

#include <omp.h>
#include <stdio.h>

int main() {
    int i;
    #pragma omp parallel for private(i) num_threads(50)
    for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}

打印出它正在使用线程 0 到 49,但这并没有显示它正在处理哪个核心,因为我只有八个核心。通过查看活动监视器(OP 提到了 GCD,因此必须在 Mac 上 - 转到窗口/CPU 使用情况),您可以看到作业在核心之间切换,因此核心!=线程。

EDIT: Added detail for Grand Central Dispatch in response to OP comment.

While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."

It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:

#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int i, nthreads;
    clock_t clock_timer;
    double wall_timer;
    for (nthreads = 1; nthreads <=8; nthreads++) {
        clock_timer = clock();
        wall_timer = omp_get_wtime();
        #pragma omp parallel for private(i) num_threads(nthreads)
        for (i = 0; i < 100000000; i++) cos(i);
        printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
            nthreads, \
            (double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
            omp_get_wtime() - wall_timer);
    }
}

The results are:

1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033

You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.

The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.

EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:

struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
    int stride = 1e8/iterations;
    gettimeofday(&t1,0);
    dispatch_apply(iterations, queue, ^(size_t i) { 
        for (int j = 0; j < stride; j++) cos(j); 
    });
    gettimeofday(&t2,0);
    NSLog(@"%d iterations: on wall = %.3f\n",iterations, \
                t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}

and I get the following results on the console:

2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034

which is about the same as I was getting above.

This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.

EDIT: I note that Jacob's answer includes

I use the omp_get_thread_num()
function within my parallelized loop
to print out which core it's working
on... This way you can be sure that
it's running on both cores.

which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:

#include <omp.h>
#include <stdio.h>

int main() {
    int i;
    #pragma omp parallel for private(i) num_threads(50)
    for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}

prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.

久而酒知 2024-08-31 12:59:49

您的执行时间很可能不受您并行化的那些循环的限制。

我的建议是您分析一下您的代码,看看哪些内容占用了大部分时间。大多数工程师会告诉您,在采取任何重大优化措施之前,您应该先执行此操作。

Most likely your execution time isn't bound by those loops you parallelized.

My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.

又怨 2024-08-31 12:59:49

没有任何细节很难猜测。也许您的应用程序甚至不受 CPU 限制。您在代码运行时观察过 CPU 负载吗?是否至少在一个核心上达到了 100%?

It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?

甜是你 2024-08-31 12:59:49

您的问题缺少一些非常关键的细节,例如您的应用程序的性质是什么,您试图改进其中的哪一部分,分析结果(如果有)等......

话虽如此,您在接近时应该记住几个关键点性能改进工作:

  • 工作应始终集中在已通过分析证明效率低下的代码区域
  • 并行化 CPU 绑定代码将几乎永远 不会提高性能(在单核机器)。您将在不必要的上下文切换上浪费宝贵的时间,并且什么也得不到。通过这样做,您可以很容易降低性能
  • 即使您在多核机器上并行化 CPU 密集型代码,您也必须记住,您永远无法保证并行执行。

确保您不会违背这些观点,因为有根据的猜测(除非有任何其他细节)会说这正是您正在做的事情。

Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...

Having said that you should remember several critical points when approaching a performance improvement effort:

  • Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
  • Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
  • Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.

Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.

得不到的就毁灭 2024-08-31 12:59:49

如果您在循环内使用大量内存,则可能会妨碍其更快。您还可以查看 pthread 库来手动处理线程。

If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.

罗罗贝儿 2024-08-31 12:59:49

如果您不指定 num_threads,我会在并行循环中使用 omp_get_thread_num() 函数来打印它正在处理的核心。例如,

printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());

上面的内容适用于此编译指示
#pragma ompparallel for default(none)shared(a,b,c)

这样您就可以确保它在两个内核上运行,因为只会创建 2 个线程。

顺便问一下,编译时是否启用了 OpenMP?在 Visual Studio 中,您必须在属性页中启用它,C++ ->语言并将OpenMP支持设置为

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,

printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());

The above will work for this pragma
#pragma omp parallel for default(none) shared(a,b,c)

This way you can be sure that it's running on both cores since only 2 threads will be created.

Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文