为什么这个 OpenMP 程序比单线程慢?

发布于 2024-11-19 12:14:44 字数 812 浏览 2 评论 0原文

请看一下这段代码。

单线程程序:http://pastebin.com/KAx4RmSJ。编译为:

g++ -lrt -O2 main.cpp -o nnlv2

使用 openMP 的多线程: http://pastebin.com/fbe4gZSn 编译为:

g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp

我在双核系统上测试了它(所以我们有两个线程并行运行)。但多线程版本比单线程版本慢(并且显示时间不稳定,尝试运行几次)。怎么了?我哪里做错了?

一些测试:

单线程:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 1898983

10 500 500 --- 11009094

10 1000 1000 --- 48116913

多线程:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 2518262

10 500 500 --- 13861504

10 1000 1000 --- 53446849

我不明白出了什么问题。

Please look at this code.

Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:

g++ -lrt -O2 main.cpp -o nnlv2

Multithread with openMP: http://pastebin.com/fbe4gZSn
Compiled with:

g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp

I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?

Some tests:

Single-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 1898983

10 500 500 --- 11009094

10 1000 1000 --- 48116913

Multi-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 2518262

10 500 500 --- 13861504

10 1000 1000 --- 53446849

I don't understand what is wrong.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

怎樣才叫好 2024-11-26 12:14:44

您的目标是学习 OpenMP,还是让您的程序更快?如果是后者,则更值得编写乘加代码、减少传递次数并合并 SIMD。

步骤 1:组合循环并使用乘加:

// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
  for(int j=0;j<NEURONS;j++)
  {
    outputs[j] = 0;

    for(int k=0,l=0;l<INPUTS;l++,k++)
    {
      outputs[j] += inputs[l] * weights[i][k];
    }

    outputs[j] = sigmoid(outputs[j]);
  }

  std::swap(inputs, outputs);
}

Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.

Step 1: Combine loops and use multiply-add:

// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
  for(int j=0;j<NEURONS;j++)
  {
    outputs[j] = 0;

    for(int k=0,l=0;l<INPUTS;l++,k++)
    {
      outputs[j] += inputs[l] * weights[i][k];
    }

    outputs[j] = sigmoid(outputs[j]);
  }

  std::swap(inputs, outputs);
}
梦里南柯 2024-11-26 12:14:44

使用 -static 和 -p 进行编译,运行然后使用 gprof 解析 gmon.out 我得到:

45.65% gomp_barrier_wait_end

这是 opemmp 的屏障例程中的大量时间。这是等待其他线程完成所花费的时间。由于您多次运行并行 for 循环(层),因此您失去了并行运行的优势,因为每次并行 for 循环完成时,都会有一个隐式屏障调用,直到所有其他线程完成后才会返回。

compiling with -static and -p, running and then parsing gmon.out with gprof I got:

45.65% gomp_barrier_wait_end

That's a lot of time in opemmp's barrier routine. that is the time spent waiting for the other threads to finish. since you're running the parallel for loops many times (LAYERS), you loose the advantage of running in parallel since every time a parallel for loop is finished, there is an implicit barrier call which won't return till all other threads finish.

伴我心暖 2024-11-26 12:14:44

首先,在多线程配置上运行测试,并确保 procexp 或任务管理器将显示 100% 的 CPU 使用率。如果没有,那么您就不会使用多线程或多处理器核心。

另外,摘自 wiki:

环境变量

一种改变 OpenMP 应用程序执行功能的方法。用于控制循环迭代调度、默认线程数等。例如 OMP_NUM_THREADS 用于指定应用程序的线程数。

Before anything else, run the test on Multi-thread configuration and MAKE SURE that procexp or task manager will show you 100% CPU usage for it. If it doesn't, then you don't use multiple threads nor multiple processor cores.

Also, taken from wiki:

Environment variables

A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.

情话墙 2024-11-26 12:14:44

我不知道您在哪里实际使用了 OpenMP - 尝试在主循环上方使用 #pragma omp parallel... (记录在此处

速度缓慢可能是包括 OpenMP 及其初始化、添加代码膨胀或以其他方式由于您引入的编译器标志来启用它而更改编译。或者,循环非常小且简单,以至于线程的开销远远超过了性能增益。

I don't see where you have actually used OpenMP - try #pragma omp parallel for above the main loop... (documented here, for example)

The slowness is possibly from including OpenMP and it initialising, adding code bloat or otherwise changing the compilation as a result of the compiler flags you introduced to enable it. Alternatively the loops are so small and simple that the overhead of threading far exceeds the performance gain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文