使用 omp 并行进行内部 omp 并行 omp 任务的效率

发布于 2025-01-16 14:10:45 字数 1184 浏览 1 评论 0原文

我正在尝试使用 OpenMP 提高复杂项目的效率。我将使用一个玩具示例来展示我面临的情况。

首先，我有一个递归函数，我使用 omp task 来改进它，如下所示：

void CollectMsg(Node *n) {
#pragma omp task
{
    CollectMsg(n->child1);
    ProcessMsg1(n->array, n->child1->array);
}

#pragma omp task
{
    CollectMsg(n->child2);
    ProcessMsg1(n->array, n->child2->array);
}

#pragma omp taskwait
    ProcessMsg2(n->array);
}

void main() {

    #pragma omp parallel num_threads(2)
    {
#pragma omp single
        {
            CollectMsg(root);
        }
    }
}

然后，函数 ProcessMsg1 包含一些计算，我想使用 omp parallel for 就可以了。以下代码显示了一个简单的示例。这样，项目就正确运行了，但效率只是稍微提高了一些。我想知道这个过程的工作机制。如果我在 omp task 中使用 omp parallel for 是否有效？我有这个问题是因为在函数 ProcessMsg1 中使用和不使用 omp parallel for 的效率只有细微的差别。如果答案是肯定的，首先我为omp task创建了2个线程，并且在每个任务的处理过程中，我为omp parallel for创建了2个线程，那么我是不是应该这样想这个并行进程是 2*2=4 线程并行吗？如果答案是否定的，那么如何进一步提高这个案例的效率呢？

void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

原文

I am trying to improve the efficiency of a complicated project using OpenMP. I will use a toy example to show the situation I am facing.

First, I have a recursion function and I use omp task to improve it, like this:

void CollectMsg(Node *n) {
#pragma omp task
{
    CollectMsg(n->child1);
    ProcessMsg1(n->array, n->child1->array);
}

#pragma omp task
{
    CollectMsg(n->child2);
    ProcessMsg1(n->array, n->child2->array);
}

#pragma omp taskwait
    ProcessMsg2(n->array);
}

void main() {

    #pragma omp parallel num_threads(2)
    {
#pragma omp single
        {
            CollectMsg(root);
        }
    }
}

Then, however, the function ProcessMsg1 contains some computations and I want to use omp parallel for on it. The following code shows a simple example. In this way, the project runs correctly, but the efficiency of it is just slightly improved. I am wondering the working mechanism of this process. Does it work if I use omp parallel for inside omp task? I have this question because there is only a slight difference on efficiency between using and not using omp parallel for in function ProcessMsg1. If the answer is yes, first I created 2 threads for omp task, and in the processing of each task, I created 2 threads for omp parallel for, then should I think of this parallel process as 2*2=4 threaded parallelism? If the answer is no, then how can I further improve the efficiency of this case?

void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赴月观长安 2025-01-23 14:10:45

在任务内嵌套并行 for 是可行的，但需要启用嵌套执行，并且会导致各种难以解决的问题，特别是在负载平衡和创建的实际线程数方面。

我建议改用 taskloop 构造：

void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

这将创建额外的任务来执行循环。如果 ProcessMsg1 从系统中的不同任务执行，则 taskloop 创建的任务将自动与您的代码创建的所有其他任务混合。

根据您的代码每次循环迭代执行的工作量，您可能需要使用grainsize子句调整任务的大小（我只是输入 100 作为示例）。有关更多详细信息，请参阅 OpenMP 规范。

如果您的代码不必等到 ProcessMsg1 中的所有循环都已执行，那么您可以添加 nowait 子句，这基本上意味着一旦所有循环任务创建完成后，任务仍然可以在其他线程上执行，而 ProcessMsg1 函数已经退出。默认行为 taskloop 是它将等待循环任务（以及它们创建的任何潜在子任务）完成。

Nesting a parallel for inside a task will work, but will require you enable nested execution and will lead to all sorts of difficult issues to resolve, especially around load balancing and the actual number of threads being created.

I would recommend using the taskloop construct instead:

void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

This will create additional tasks for to execute the loop. If ProcessMsg1 is executed from different tasks in the system, the tasks created by taskloop will mix automatically with all the other tasks that your code is creating.

Depending on how much work your code does per loop iteration, you may want to adjust the size of the tasks using the grainsize clause (I just put 100 to have an example). See the OpenMP Spec for further details.

If you your code does not have to wait until all of the loop in ProcessMsg1 has been executed, then you can add the nowait clause, which basically means that once all loop tasks have been created, tasks can still execute on other threads, while the ProcessMsg1 function is already exiting. The default behavior taskloop though is that it will wait for the loop tasks (and any potential child tasks they create) have completed.

回复收藏 0 原文

阳光下的泡沫是彩色的 2025-01-23 14:10:45

问：
_{“...如何进一步提高此案例的效率？”}

答：
遵循“成本经济”
一些可见的
一些较少，但仍然决定了结果（im）性能 & （在）效率

在任何 PAR 部分开始之前，
我们已经支付了多少费用？

让我们开始建议添加多个代码执行流的成本：

使用 ASM 指令量来简化衡量需要完成多少工作（其中所有 CPU 时钟 + RAM 分配成本 + RAM-I/O + O/S 系统管理所花费的时间很重要），我们开始看到所有这些（不可避免的）附加成本的相对成本，与实际有用的任务相比（即最终花费了多少 ASM 指令）想要计算机，对比实现这一目标所需的已经消耗的间接费用的数量）

如果同时追求性能和性能，那么这个分数是至关重要的。（资源使用模式的）效率。

对于附加开销成本占主导地位的情况，这些情况是反模式的直接罪过。

对于附加开销成本不到有用工作 0.01% 的情况，我们仍然可能会导致令人不满意的低加速（请参阅模拟器和相关详细信息）。

对于有用工作的范围减少所有附加管理费用的情况，我们仍然看到阿姆达尔定律上限 - 所谓的如此，如此不言自明 - “收益递减定律”（因为添加更多资源不再改善性能，即使我们添加无限多个 CPU 核心等）。

实验实例化成本的提示：

我们首先假设代码是 100%可并行（没有纯[SERIAL]部分，这永远不是真实的，但让我们使用它）。

- 让我们将模拟器中的 N_CPUcores-滑块移动到 64 核以上的任何位置

- 接下来移动 >O_verhead - 将模拟器中的滑块设置为任何高于普通零的值（表示生成以下之一的相对附加成本） N_CPUcore 个进程，以百分比表示，相对于指令的 [PARALLEL] 部分部分数量 - 数学上的“密集”工作有许多这样的“有用”指令，并且假设没有其他性能杀手跳出盒子，可能会花费一些合理数量的“附加”成本，以产生一定数量的并发或并行操作（实际数量仅取决于实际情况成本经济，不在于有多少个CPU核心，更不在于我们的“愿望”或学术甚至更糟糕的复制/粘贴“建议”）。相反，数学上的“浅薄”工作几乎总是“加速”<< 1（巨大减速），因为几乎没有机会证明这一点是合理的。这已知的附加成本（如果将参数移入并移回结果，则在线程/进程实例、数据 SER/xfer/DES 上支付，如果在进程之间，则情况更糟）

- 接下来将模拟器中的O_verhead-滑块移动到最右边缘== 1.这显示了这种情况，当实际的线程/进程生成开销（时间损失）成本仍然不超过全部的 <= 1 % 时接下来是与计算相关的指令，这些指令将针对工作的“有用”部分执行，这些指令将在此类生成的流程实例内进行计算。因此，即使是这样的 1:100 比例因子（比损失的 CPU 时间多 100 倍的“有用”工作，因为安排许多副本并使 O/S 调度程序在可用系统虚拟内存中协调其并发执行而被烧毁）加速降级进程图表中已经显示了所有警告 - 只需稍微玩一下O_verhead-模拟器中的滑块，在接触其他滑块之前......

- 仅在此处触摸模拟器中的 p 滑块并将其移动到小于 100% 的位置（没有 [SERIAL] - 问题执行的一部分，到目前为止理论上很好，但在实践中永远无法实现，甚至程序启动也是纯粹的-[SERIAL]，按设计）

所以，
除了直错误，
除了性能反模式之外，
对于 ILP、SIMD 向量化、缓存行尊重技巧还有很多技术推理，这些技巧首先开始挤出任务可以重构

的最大可能性能真正的问题永远不会违背收集到的有关性能的知识，因为重复不起作用的事情不会带来任何优势，不是吗？
尊重您的物理平台限制，忽略它们会降低您的性能
基准、配置文件、重构
基准、配置文件、重构
基准、配置文件、重构
这里没有其他可用的魔杖。

细节始终很重要。 CPU/NUMA 架构细节非常重要。检查实际本机架构可能性的任何可能性，因为如果没有所有这些细节，运行时性能将无法达到技术上可用的功能。

Q :
_{" ... how can I further improve the efficiency of this case? "}

A :
follow the "economy-of-costs"
some visible
some less, yet nevertheless also deciding the resulting ( im )performance & ( in )efficiency

How much did we already have to pay,
before any PAR-section even started?

Let's start with the costs of the proposed adding multiple streams of code-execution :

Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks + RAM-allocation-costs + RAM-I/O + O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )

This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).

For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns.

For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speed-ups (see the simulator and related details).

For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes ).

Tips for experimenting with the Costs-of-instantiation(s) :

We start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL] part, which is never real, yet let's use it ).

- let's move the N_CPUcores-slider in the simulator to anything above 64-cores

- next move the O_verhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of N_CPUcores processes, as a number of percent, relative to the such [PARALLEL]-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speed-ups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )

- next move the O_verhead-slider in the simulator to the rightmost edge == 1. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 % of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speed-up-degradation - just play a bit with the O_verhead-slider in the simulator, before touching the others...

- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL], by design )

So,
besides straight errors,
besides performance anti-patterns,
there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get

refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
respect your physical platform constraints, ignoring them will degrade your performance
benchmark, profile, refactor
benchmark, profile, refactor
benchmark, profile, refactor
no other magic wand available here.

Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.

回复收藏 0 原文

~没有更多了~