openmp并行的矩阵向量乘法和加法任务
我想计算以下矩阵向量乘法和加法运算:
y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x
我可以使用 openmp 中的任务并行来加速此操作吗?代码如下
y.setZero();
int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
{
VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
yA.setZero();
yC.setZero();
yCt.setZero();
yR.setZero();
yRt.setZero();
yD1.setZero();
yD1t.setZero();
yD2.setZero();
yD2t.setZero();
#pragma omp task
yA = A * x;
#pragma omp task
yC = C * x;
#pragma omp task
yCt = C.transpose() * x;
#pragma omp task
yR = R * x;
#pragma omp task
yRt = R.transpose() * x;
#pragma omp task
yD1 = D1 * x;
#pragma omp task
yD1t = D1.transpose() * x;
#pragma omp task
yD2 = D2 * x;
#pragma omp task
yD2t = D2.transpose() * x;
#pragma omp taskwait
y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
}
};
该代码基于 Eigen 库。与不使用 openmp 相比,我在计算机中观察不到明显的加速。但我认为可以与openmp并行。所以我问这个问题。如果有人能回答我的问题,我将不胜感激。
I would like to compute the following matrix-vector multiplication and adding operation as
y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x
Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow
y.setZero();
int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
{
VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
yA.setZero();
yC.setZero();
yCt.setZero();
yR.setZero();
yRt.setZero();
yD1.setZero();
yD1t.setZero();
yD2.setZero();
yD2t.setZero();
#pragma omp task
yA = A * x;
#pragma omp task
yC = C * x;
#pragma omp task
yCt = C.transpose() * x;
#pragma omp task
yR = R * x;
#pragma omp task
yRt = R.transpose() * x;
#pragma omp task
yD1 = D1 * x;
#pragma omp task
yD1t = D1.transpose() * x;
#pragma omp task
yD2 = D2 * x;
#pragma omp task
yD2t = D2.transpose() * x;
#pragma omp taskwait
y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
}
};
This code is based on the Eigen library. Compared with not using the openmp, I can not observe a noticeable acceleration in my computer. But I think it can be paralleled by openmp. So I ask this question. I would appreciate if anybody can answer my question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的。
虽然在你的代码中,我可以看到 3 个需要改进的地方。
1) 数据共享 -
开始使用 OpenMP 时,我建议在您的任务中明确使用
default(none)、shared(...) 或 firstprivate(...)
- 这样您就可以了解变量的实际复制方式。2) 缩放 -
就性能而言,您的代码不会扩展到 9 个核心以上,因为您只执行 9 个任务。
在进行代数运算时,您可以通过块运算对应用程序进行任务化。
例如,您可以将
A = B + C
拆分为 4 个独立的块操作(使用 Eigen 接口...)这里的块大小是
n/2
并定义您的 任务粒度3) 同步性 -
为了确保正确的执行顺序,您使用了
#pragma omp taskwait
- 这是一个等待先前生成的任务完成的显式屏障。从 OpenMP 4.0 开始,任务可以具有依赖项,以保证更精细的执行顺序。
混合起来 -
我用你的问题的 3 个版本做了一个微基准 - 源
我将计算分为 2 个操作,并用 m 个块对每个操作进行任务化。
添加
-M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
mult
-y := Mx
在此图中,是
m=2
的依赖任务图 -add(0, 0)
任务在左上角的块上工作,add( 1, 0)
任务在右上角的块上工作,等等... -mult
任务取决于在同一行上工作的add
任务块 -norm
任务是正确性检查。我在 16 个内核上运行它,使用
n=4,096
和m=16
(每行操作的块数)。这是我的结果我再次运行它,但这次是在 64 个核心上,并且使用
m=64
“块版本”随内核扩展,但“taskwait”版本则不然。
这是第一次执行的甘特图 (
m=16
)。大蓝色任务对应于顺序版本执行,红色任务对应于任务等待版本执行。紫色和粉色任务分别对应块版本的
add
和mult
任务。如您所见,taskwait 版本仅使用 16 个核心中的 9 个,而 block 版本可以使用每个核心。
块版本的缩放,并使用箭头启用依赖关系渲染
这些数字是使用 多处理器计算 (MPC) - 实现 OpenMP 运行时和跟踪工具。
Yes.
Though in your code, I can see 3 things to improve.
1) Data sharing -
When starting with OpenMP, I recommend using
default(none), shared(...) or firstprivate(...)
explicitely on your tasks - so you understand how your variable are actually copied.2) Scaling -
In term of performances, your code will not scales above 9 cores, since you only express 9 tasks.
When doing algebra, you can taskify applications through block operations.
For instance, you can split
A = B + C
into 4 independant block operations (using Eigen interface...)The block size here is
n/2
and defines your tasks granularity3) Synchronousity -
To ensure correct order of execution, you used a
# pragma omp taskwait
- which is an explicit barrier that waits for completion of previously spawned tasks.Since OpenMP 4.0, tasks can have dependences to guarantee order of execution on a finer level.
Mixing things up -
I made a micro-benchmark with 3 versions of your problem - source
I splited the computation into 2 operations and taskified each operations with
m
blocks.add
-M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
mult
-y := M.x
On this figure, is the dependency task graph for
m=2
- theadd(0, 0)
task works on the top-left block, theadd(1, 0)
task works on the top-right block, etc... - themult
tasks depends from theadd
tasks working on the same line of blocks - thenorm
task is a correctness check.I run it on 16 cores, with
n=4,096
andm=16
(n° of blocks per line-wise operation). Here is my resultI run it again, but on 64 cores this time, and with
m=64
The "block version" scales with cores, but the "taskwait" version does not.
Here is the Gantt chart of the 1st execution (
m=16
).The big blue task correspond to the sequential-version execution, the red ones to the taskwait-version execution. The purple and pink tasks correspond respectively to the
add
andmult
tasks of the block-version.As you can see, the taskwait-version only uses 9 cores out of 16, while the block-version can use every cores.
a zoom on the block-version, and enabling dependences rendering with arrows
The figures were generated using Multi-Processor-Computing (MPC) - which implements an OpenMP runtime and tracing tools.