openmp并行的矩阵向量乘法和加法任务

发布于 2025-01-12 21:14:43 字数 1215 浏览 6 评论 0原文

我想计算以下矩阵向量乘法和加法运算：

y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x

我可以使用 openmp 中的任务并行来加速此操作吗？代码如下

    y.setZero();
    int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
    {
        VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
        yA.setZero();
        yC.setZero();
        yCt.setZero();
        yR.setZero();
        yRt.setZero();
        yD1.setZero();
        yD1t.setZero();
        yD2.setZero();
        yD2t.setZero();

    #pragma omp task
        yA = A * x;
    #pragma omp task
        yC = C * x;
    #pragma omp task
        yCt = C.transpose() * x;
    #pragma omp task
        yR = R * x;
    #pragma omp task
        yRt = R.transpose() * x;
    #pragma omp task
        yD1 = D1 * x;
    #pragma omp task
        yD1t = D1.transpose() * x;
    #pragma omp task
        yD2 = D2 * x;
    #pragma omp task
        yD2t = D2.transpose() * x;

    #pragma omp taskwait
        y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
    }
};

该代码基于 Eigen 库。与不使用 openmp 相比，我在计算机中观察不到明显的加速。但我认为可以与openmp并行。所以我问这个问题。如果有人能回答我的问题，我将不胜感激。

原文

I would like to compute the following matrix-vector multiplication and adding operation as

y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x

Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow

    y.setZero();
    int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
    {
        VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
        yA.setZero();
        yC.setZero();
        yCt.setZero();
        yR.setZero();
        yRt.setZero();
        yD1.setZero();
        yD1t.setZero();
        yD2.setZero();
        yD2t.setZero();

    #pragma omp task
        yA = A * x;
    #pragma omp task
        yC = C * x;
    #pragma omp task
        yCt = C.transpose() * x;
    #pragma omp task
        yR = R * x;
    #pragma omp task
        yRt = R.transpose() * x;
    #pragma omp task
        yD1 = D1 * x;
    #pragma omp task
        yD1t = D1.transpose() * x;
    #pragma omp task
        yD2 = D2 * x;
    #pragma omp task
        yD2t = D2.transpose() * x;

    #pragma omp taskwait
        y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
    }
};

This code is based on the Eigen library. Compared with not using the openmp, I can not observe a noticeable acceleration in my computer. But I think it can be paralleled by openmp. So I ask this question. I would appreciate if anybody can answer my question.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

究竟谁懂我的在乎 2025-01-19 21:14:43

我可以使用 openmp 中的任务并行来加速此操作吗？代码如下
块引用

是的。
虽然在你的代码中，我可以看到 3 个需要改进的地方。

1) 数据共享 -
开始使用 OpenMP 时，我建议在您的任务中明确使用 default(none)、shared(...) 或 firstprivate(...) - 这样您就可以了解变量的实际复制方式。

2) 缩放 -
就性能而言，您的代码不会扩展到 9 个核心以上，因为您只执行 9 个任务。
在进行代数运算时，您可以通过块运算对应用程序进行任务化。
例如，您可以将 A = B + C 拆分为 4 个独立的块操作（使用 Eigen 接口...）

A(0, 0, n/2, n/ 2) = B(0, 0, n/2, n/2) + C(0, 0, n/2, n/2)
A(0, n/2, n/2, n/2) = B (0, n/2, n/2, n/2) + C(0, n/2, n/2, n/2)
A(n/2, 0, n/2, n/2) = B(n/2, 0, n/2, n/2) + C(n/2, 0, n/2, n/2)
A(n/2, n/2, n/2, n/2) = B(n/2, n/2, n/2, n/2) + C(n/2, n/2, n/2, n/2)

这里的块大小是 n/2 并定义您的任务粒度

3) 同步性 -
为了确保正确的执行顺序，您使用了#pragma omp taskwait - 这是一个等待先前生成的任务完成的显式屏障。
从 OpenMP 4.0 开始，任务可以具有依赖项，以保证更精细的执行顺序。

混合起来 -
我用你的问题的 3 个版本做了一个微基准 - 源

顺序 - 计算是在 1 个核心
任务等待上完成的- 您的版本
块 - 使用 Eigen 块操作和 OpenMP 依赖项。
我将计算分为 2 个操作，并用 m 个块对每个操作进行任务化。
- 添加 - M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
- mult - y := Mx
  在此图中，是 m=2 的依赖任务图 - add(0, 0) 任务在左上角的块上工作，add( 1, 0) 任务在右上角的块上工作，等等... - mult 任务取决于在同一行上工作的 add 任务块 - norm 任务是正确性检查。

我在 16 个内核上运行它，使用 n=4,096 和 m=16（每行操作的块数）。这是我的结果

连续版本花费了 1.28707 秒。
taskwait 版本花费了 0.177616 秒。（成功）
任务版本耗时 0.128707 秒。（成功）

我再次运行它，但这次是在 64 个核心上，并且使用 m=64

连续版本花费了 1.31316 秒。
taskwait 版本花费了 0.170817 秒。（成功）
任务版本花费了 0.0785489 秒。（成功）

“块版本”随内核扩展，但“taskwait”版本则不然。

这是第一次执行的甘特图 (m=16)。
大蓝色任务对应于顺序版本执行，红色任务对应于任务等待版本执行。紫色和粉色任务分别对应块版本的 add 和 mult 任务。

如您所见，taskwait 版本仅使用 16 个核心中的 9 个，而 block 版本可以使用每个核心。

块版本的缩放，并使用箭头启用依赖关系渲染

这些数字是使用多处理器计算 (MPC) - 实现 OpenMP 运行时和跟踪工具。

Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow
Blockquote

Yes.
Though in your code, I can see 3 things to improve.

1) Data sharing -
When starting with OpenMP, I recommend using default(none), shared(...) or firstprivate(...) explicitely on your tasks - so you understand how your variable are actually copied.

2) Scaling -
In term of performances, your code will not scales above 9 cores, since you only express 9 tasks.
When doing algebra, you can taskify applications through block operations.
For instance, you can split A = B + C into 4 independant block operations (using Eigen interface...)

A(0, 0, n/2, n/2) = B(0, 0, n/2, n/2) + C(0, 0, n/2, n/2)
A(0, n/2, n/2, n/2) = B(0, n/2, n/2, n/2) + C(0, n/2, n/2, n/2)
A(n/2, 0, n/2, n/2) = B(n/2, 0, n/2, n/2) + C(n/2, 0, n/2, n/2)
A(n/2, n/2, n/2, n/2) = B(n/2, n/2, n/2, n/2) + C(n/2, n/2, n/2, n/2)

The block size here is n/2 and defines your tasks granularity

3) Synchronousity -
To ensure correct order of execution, you used a # pragma omp taskwait - which is an explicit barrier that waits for completion of previously spawned tasks.
Since OpenMP 4.0, tasks can have dependences to guarantee order of execution on a finer level.

Mixing things up -
I made a micro-benchmark with 3 versions of your problem - source

sequential - the computation is done on 1 core
taskwait - your version
block - that uses Eigen block operations, and OpenMP dependences.
I splited the computation into 2 operations and taskified each operations with m blocks.
- add - M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
- mult - y := M.x
  On this figure, is the dependency task graph for m=2 - the add(0, 0) task works on the top-left block, the add(1, 0) task works on the top-right block, etc... - the mult tasks depends from the add tasks working on the same line of blocks - the norm task is a correctness check.

I run it on 16 cores, with n=4,096 and m=16 (n° of blocks per line-wise operation). Here is my result

sequential version took 1.28707 s.
taskwait version took 0.177616 s.(SUCCESS)
task version took 0.128707 s.(SUCCESS)

I run it again, but on 64 cores this time, and with m=64

sequential version took 1.31316 s.
taskwait version took 0.170817 s.(SUCCESS)
task version took 0.0785489 s.(SUCCESS)

The "block version" scales with cores, but the "taskwait" version does not.

Here is the Gantt chart of the 1st execution (m=16).
The big blue task correspond to the sequential-version execution, the red ones to the taskwait-version execution. The purple and pink tasks correspond respectively to the add and mult tasks of the block-version.

As you can see, the taskwait-version only uses 9 cores out of 16, while the block-version can use every cores.

a zoom on the block-version, and enabling dependences rendering with arrows

The figures were generated using Multi-Processor-Computing (MPC) - which implements an OpenMP runtime and tracing tools.

回复收藏 0 原文

~没有更多了~