OpenMP C 并行化嵌套 for 循环速度较慢

发布于 2024-11-30 18:12:17 字数 556 浏览 1 评论 0原文

我一直在尝试并行化嵌套循环,如下所示:

http://pastebin.com/nkictYsw

I'我比较了该代码的顺序版本和并行版本的执行时间,但顺序版本似乎总是具有各种输入的更短的执行时间?

程序的输入是:

  • numParticles(循环索引)
  • timeStep(不重要,值不会改变)
  • numTimeSteps(循环索引)
  • numThreads(要使用的线程数)

我浏览了网络并尝试了一些东西(等等)并没有真正改变。我很确定并行代码是正确的,因为我检查了输出。我在这里做错了什么吗?

编辑:另外,似乎你不能在 C 结构上使用归约子句?

EDIT2:在 2 核 cpu 的 linux 上使用 gcc 工作。我尝试使用高达 numParticles = 40 和 numTimeSteps = 100000 的值来运行此程序。也许我应该尝试更高的值?

谢谢

I've been trying to parallelize a nested loop as shown here:

http://pastebin.com/nkictYsw

I'm comparing the execution time of a sequential version and parallelized version of this code, but the sequential version always seems to have shorter execution times with a variety of inputs?

The inputs to the program are:

  • numParticles (loop index)
  • timeStep (not important, value doesn't change)
  • numTimeSteps (loop index)
  • numThreads (number of threads to be used)

I've looked around the web and tried some things out (nowait) and nothing really changed. I'm pretty sure the parallel code is correct because I checked the outputs. Is there something wrong I'm doing here?

EDIT: Also, it seems that you can't use the reduction clause on C structures?

EDIT2: Working on gcc on linux with 2 core cpu. I have tried running this with values as high as numParticles = 40 and numTimeSteps = 100000. Maybe I should try higher?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只是一片海 2024-12-07 18:12:17

您的循环可能太小。创建线程来处理循环的一部分会产生相关开销,因此如果循环太小,并行版本可能会运行得更慢。另一个考虑因素是可用核心的数量。

您的第二个 omp 指令不太可能有用,因为该循环中的计算量少得多。我建议将其删除。

编辑:我使用 numParticles 1000 和两个线程测试了您的代码。 30秒就跑完了。单线程版本运行时间为 57 秒。即使使用 numParticles 40,我也看到了显着的加速。这是 Visual Studio 2010。

It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.

Your second omp directive is less likely to be useful because there are a lot less calculations in that loop. I would suggest to remove it.

EDIT: I tested your code with numParticles 1000 and two threads. It ran in 30 seconds. The single threaded version ran in 57 seconds. Even with numParticles 40 I see a significant speedup. This is Visual Studio 2010.

动次打次papapa 2024-12-07 18:12:17

我可以想到两个可能的减速来源:a) 编译器在顺序版本中进行了一些优化(首先是向量化),但在 OpenMP 版本中没有,b) 线程管理开销。如果您还使用单线程运行 OpenMP 版本(即将 numThreads 设置为 1),则两者都很容易检查。如果它比顺序慢得多,则 (a) 是最可能的原因;如果它与顺序类似并且比具有 2 个线程的相同代码更快,则最可能的原因是 (b)。

在后一种情况下,您可以重组 OpenMP 代码以减少开销。首先,循环内没有必要有两个并行区域(#pragma omp parallel);您可以在其中拥有一个并行区域和两个并行循环:

for (t = 0; t <= numTimeSteps; t++) {
    #pragma omp parallel num_threads(numThreads)
    {
    #pragma omp for private(j)
    /* The first loop goes here */
    #pragma omp for
    /* The second loop goes here */
    }
}

然后,可以在时间步循环之前启动并行区域:

#pragma omp parallel num_threads(numThreads) private(t)
for (t = 0; t <= numTimeSteps; t++) {
    ...
}

然后该区域中的每个线程将运行此循环,并且在每次迭代时线程将在 OpenMP 结束时同步循环。这样,无论使用哪种 OpenMP 实现,您都可以确保同一组线程在整个计算中运行。

I can think of two possible sources for slowdown: a) compiler made some optimizations (vectorization being first) in sequential version but not in OpenMP version, and b) thread management overhead. Both are easy to check if you also run the OpenMP version with a single thread (i.e. set numThreads to 1). If it is much slower than sequential, then (a) is the most likely reason; if it is similar to sequential and faster than the same code with 2 threads, the most likely reason is (b).

In the latter case, you may restructure the OpenMP code for less overhead. First, having two parallel regions (#pragma omp parallel) inside a loop is not necessary; you can have a single parallel region and two parallel loops inside it:

for (t = 0; t <= numTimeSteps; t++) {
    #pragma omp parallel num_threads(numThreads)
    {
    #pragma omp for private(j)
    /* The first loop goes here */
    #pragma omp for
    /* The second loop goes here */
    }
}

Then, the parallel region can be started before the timestep loop:

#pragma omp parallel num_threads(numThreads) private(t)
for (t = 0; t <= numTimeSteps; t++) {
    ...
}

Each thread in the region will then run this loop, and at each iteration threads will synchronize at the end of OpenMP loops. This way, you ensure that the same set of threads run through the whole computation, no matter what OpenMP implementation is used.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文