处理器子集上的不相交网格及其在 Scalapack 中的通信
总之,我的问题是关于如何在 Scalapack (BLACS) 中两个不同进程网格上的两个块循环分布矩阵之间实现矩阵复制。我正在尝试使用 pdgemr2d_ 来实现这一点,在其他情况下我经常使用它,在同一过程网格上的两个矩阵之间进行复制。
下面是对我遇到的问题状态的相当技术性的讨论。我已经解决了一个基本问题,但在我看来似乎没有解决方案......尽管必须有,因为 Scalapack 特别指出我正在尝试的操作类型是可能的。我在任何地方都找不到足够的例子。
在 C 语言中与 MPI 一起运行的 Scalapack 中 1x1 计算网格的初始化通常如下所示:
[...]
int NUM_TASKS, RANK;
int MPI_STARTUP = MPI_Init (&argc, &argv);
if (MPI_STARTUP != MPI_SUCCESS)
MPI_Abort (MPI_COMM_WORLD, MPI_STARTUP);
MPI_Comm_size (MPI_COMM_WORLD, &NUM_TASKS);
MPI_Comm_rank (MPI_COMM_WORLD, &RANK);
int CONTEXT;
Cblacs_pinfo (&RANK, &NUM_TASKS);
Cblacs_get (-1, 0, &CONTEXT);
Cblacs_gridinit (&CONTEXT, "Row", 1, 1);
Cblacs_gridinfo (CONTEXT, &NPROW, &NPCOL, &MYPROW, &MYPCOL);
[...]
无论 MPI 了解多少处理器({1, 1},网格的大小,该代码都会生成 1x1 网格)传递给 Cblacs_gridinit)。这里,CONTEXT 向 Scalapack 函数指示我们正在处理哪个网格(可以同时使用多个网格,并且由 Cblacs_get 生成)。 Cblacs_gridinfo 将 NPROW 和 NPCOL 设置为处理器行数和列数(在本例中为 {1, 1})。 MYPROW 和 MYPCOL 向每个处理器指示哪个网格块属于它。在本例中,在 1x1 网格上,只有一个处理器参与,其网格 ID 为 {0, 0}。
简单块循环分布式 100x100 矩阵的矩阵描述符的初始化通常也很简单:(
int info;
int desc[9];
int block_size = 32;
int zero = 0; int one = 1;
int num_rows = 100; int num_cols = 100;
int num_cols_local = numroc_ (&num_cols, &block_size, &mypcol, &zero, &npcol);
int num_cols_local_protect = MAX (1, num_cols_local);
int num_rows_local = numroc_ (&num_rows, &block_size, &myprow, &zero, &nprow);
int num_rows_local_protect = MAX (1, num_rows_local);
descinit_ (desc, &num_rows, &num_cols, &block_size, &block_size, &zero, &zero, \
&CONTEXT, &num_rows_local_protect, &info);
/* now allocate array with per-processor size num_cols_local_protect * num_rows_local_protect */
稍后我们将看到为什么“保护”变量是必要的,因为在某些处理器上 num_cols_local 或 num_rows_local 将非常正确地作为负整数返回.)
上面的大部分内容都是不言自明的,除了传递给 descinit_ 的 &zeros 之外,它指示分布矩阵第一行的处理器行以及分布第一列的处理器列。当在 descinit_ 函数中使用时,这些值具有非常明确的界限。从 Fortran 函数本身来看,
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
IRSRC 和 ICSRC 我们在这里传递为零,因为 {0,0} 是我们单个网格块的正确索引。即使网格大得多,我们仍然可能会传递 {0,0},因为第一个处理器块可能会存储第一行和第一列值。
当在一个处理器上运行时,效果非常好。唯一处理器 RANK 0 上的 NPROW、NPCOL、MYPROW 和 MYPCOL 的值分别为 1、1、0 和 0。本例中的 CONTEXT 为 0,其非负性表明它所引用的网格在此 RANK 上处于活动状态。这些值指示 1x1 进程网格的存在,并且第一个处理器具有正确的指示属于 RANK 0 的正确进程网格块。在这种情况下,它是唯一的块。
然而,当在两个处理器上运行时,事情就会崩溃,而且它们不应该正式崩溃。在第一和第二 RANK 上,我们有 CONTEXT、NPROW、NPCOL、MYPROW 和 MYCOL:
RANK 0: 0, 1, 1, 0, 0
RANK 1: -1, -1, -1, -1, -1
所有值都是负数。最重要的是,RANK 1 上的 CONTEXT 为负,表明该 RANK 不参与我们的 1x1 处理器网格。现在调用 descinit_,立即成为所有处理器上的问题。如果我们引用 descinit_ 中的 Fortran 代码,我们就会得到(为了清楚起见,从上面重复):
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
只要每个处理器都参与网格,这些限制就有意义。索引不能为负数或大于或等于过程网格中的总行数或列数,因为这样的网格块不存在!
然后,在 RANK 1 上,IRSRC 作为零传递,但 NPROW 和 NPCOL 从网格初始化返回为 -1,因此 descinit_ 将始终失败。
通过简单地将矩阵描述符的初始化和所有后续操作限制到参与当前网格的处理器,可以轻松地克服上述所有问题,但并不优雅。例如:
if (CONTEXT > -1) {
[...]
但是,我不仅有一个处理器网格,我有两个,并且我需要它们通过 pdgemr2d_ 函数进行通信。该函数的目的是将一个网格上的分布式矩阵 A 的子集复制到另一个网格上的分布式矩阵 B。网格不需要以任何方式彼此相关,并且可以部分或完全不相交。这应该是一个微不足道的操作。举例来说,我想将完整矩阵从具有上下文 CONTEXT_A 的处理器网格复制到具有上下文 CONTEXT_B 的处理器网格。每个上下文中矩阵的描述符由 desc_A 和 desc_B 给出。
pdgemr2d_ (&num_rows, &num_cols, matrix_A, &one, &one, \
desc_A, matrix_B, &one, &one, desc_B, &CONTEXT_B);
这也是相当不言自明的。它必须在任一上下文具有任何网格成员的所有处理器上运行。在我的例子中,CONTEXT_A 具有一个跨越 MPI 知道的所有处理器的网格,而 CONTEXT_B 是一个 1x1 单处理器网格。
pdgemr2d_ 必须提供至少包含 CONTEXT_A 和 CONTEXT_B 中包含的所有处理器的上下文标识符,对于那些不属于 CONTEXT_A 或 CONTEXT_B 的处理器,元素 desc_A[CTXT] 或 desc_B[CTXT] 必须分别设置为 - 1 在该处理器上。
理论上,descinit_ 可以优雅地完成此操作,因为 Cblacs_gridinit 返回的 CONTEXT 值在不参与该上下文网格的任何处理器上均为 -1。然而,由于上面详细描述的 NPROW 和 NPCOL 负值的限制,descinit_ 将不会在任何不参与网格的处理器上生成正确的矩阵描述符。
为了进行正确的不相交网格通信,必须在参与任一上下文的所有处理器上定义这样的矩阵描述符。
显然,pdgemr2d_不能以此作为一个不可克服的缺陷来编写,因为代码中的函数描述具体指出:
PDGEMR2D 将 A 的子矩阵复制到 B 的子矩阵上。 A 和 B 可以有不同的分布:它们可以位于不同的位置 处理器网格,它们可以有不同的块大小,开始 复制的区域可以位于A和B的不同位置。
非常感谢您的帮助,我知道这是一个相当专业的问题。
In summary, my question is about how to implement a matrix copy between two block-cyclically distributed matrices on two different process grids in Scalapack (BLACS). I'm attempting to implement this with the pdgemr2d_, which I use frequently in other cases where I am copying between two matrices on the same process grid.
Below is a fairly technical discussion of the state of the problem I'm running into. I've got it nailed down to a basic problem, but there doesn't appear to me to be a solution ... though there must be, as Scalapack specifically states that the type of operation I'm trying for is possible. I do not find adequate examples of this anywhere.
An initialization of a 1x1 compute grid in Scalapack running with MPI in C usually goes something like this:
[...]
int NUM_TASKS, RANK;
int MPI_STARTUP = MPI_Init (&argc, &argv);
if (MPI_STARTUP != MPI_SUCCESS)
MPI_Abort (MPI_COMM_WORLD, MPI_STARTUP);
MPI_Comm_size (MPI_COMM_WORLD, &NUM_TASKS);
MPI_Comm_rank (MPI_COMM_WORLD, &RANK);
int CONTEXT;
Cblacs_pinfo (&RANK, &NUM_TASKS);
Cblacs_get (-1, 0, &CONTEXT);
Cblacs_gridinit (&CONTEXT, "Row", 1, 1);
Cblacs_gridinfo (CONTEXT, &NPROW, &NPCOL, &MYPROW, &MYPCOL);
[...]
This code would generate a 1x1 grid regardless of the number of processors MPI knows about ({1, 1}, the size of the grid, is passed to Cblacs_gridinit). Here, CONTEXT indicates to Scalapack functions which grid we're working on (it is possible to use more than one simultaneously, and is generated by Cblacs_get). Cblacs_gridinfo sets NPROW and NPCOL as the number of processor rows and columns ({1, 1} in this case). MYPROW and MYPCOL indicate to each processor which grid block belongs to it. In this case, on a 1x1 grid, only one processor participates and its grid IDs are {0, 0}.
Initialization of a matrix descriptor for a simple block-cyclic distributed 100x100 matrix is typically also simple:
int info;
int desc[9];
int block_size = 32;
int zero = 0; int one = 1;
int num_rows = 100; int num_cols = 100;
int num_cols_local = numroc_ (&num_cols, &block_size, &mypcol, &zero, &npcol);
int num_cols_local_protect = MAX (1, num_cols_local);
int num_rows_local = numroc_ (&num_rows, &block_size, &myprow, &zero, &nprow);
int num_rows_local_protect = MAX (1, num_rows_local);
descinit_ (desc, &num_rows, &num_cols, &block_size, &block_size, &zero, &zero, \
&CONTEXT, &num_rows_local_protect, &info);
/* now allocate array with per-processor size num_cols_local_protect * num_rows_local_protect */
(We will see later why the "protect" variables are necessary, as on some processors num_cols_local or num_rows_local will be returned, quite correctly, as negative integers.)
Most of the above is self explanatory except for the &zeros passed to descinit_, which indicate the processor row on which the first row of the matrix is distributed, and the processor column on which the first column is distributed. These values have very explicit bounds when used in the descinit_ function. From the Fortran function itself,
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
IRSRC and ICSRC we pass as zero here, since {0,0} is the proper index of our single grid block. Even if the grid were much larger, we would likely still pass {0,0}, as the first processor block would likely store the first row and column values.
When run on one processor, this works very well. The values on the only processor, RANK 0, for NPROW, NPCOL, MYPROW and MYPCOL, are 1, 1, 0, and 0, respectively. CONTEXT in this case is 0, its non-negativity indicating that the grid it refers to is active on this RANK. These values indicate the existence of a 1x1 process grid and the first processor has the correct indicates the correct process grid block belonging to RANK 0. In this case, it is the only block.
When run on two processors, however, things break down, and they shouldn't formally. On the first and second RANK, we have for CONTEXT, NPROW, NPCOL, MYPROW and MYCOL:
RANK 0: 0, 1, 1, 0, 0
RANK 1: -1, -1, -1, -1, -1
All values are negative. Most importantly, CONTEXT on RANK 1 is negative, indicating that this RANK does not participate in our 1x1 processor grid. Calling descinit_ now, immediately becomes a problem on all processors. If we reference the Fortran code from descinit_, we have (repeated from above for clarity):
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
These restrictions make sense as long as each processor is participating in the grid. The indices cannot be negative or greater than or equal to the total number of rows or columns in the process grid, since such grid blocks don't exist!
On RANK 1 then, IRSRC is passed as zero, but NPROW and NPCOL are returned from initialization of the grid as -1, and therefore descinit_ will always fail.
All of the above can be easily overcome, inelegantly, by simply restricting the initialization of the matrix descriptor and all subsequent operations to processors who participate in the current grid. Something like:
if (CONTEXT > -1) {
[...]
However, I don't have just one processor grid, I have two, and I need them to communicate by means of the pdgemr2d_ function. The purpose of this function is to copy a subset of a distributed matrix A on one grid to a distributed matrix B on another grid. The grids need not be related to each other in any way, and may be partially or completely disjoint. This should be a trivial operation. Say, for example, I want to copy full matrix from a processor grid with context CONTEXT_A to a processor grid with context CONTEXT_B. The descriptors for the matrix in each context are given as desc_A and desc_B.
pdgemr2d_ (&num_rows, &num_cols, matrix_A, &one, &one, \
desc_A, matrix_B, &one, &one, desc_B, &CONTEXT_B);
This is also fairly self explanatory. It must run on all processors which either context has any grid members on. In my case, CONTEXT_A has a grid spanning all processors MPI knows about, and CONTEXT_B is a 1x1 single processor grid.
pdgemr2d_ must be supplied with a context identifier englobing at least all processors included in both CONTEXT_A and CONTEXT_B, and for those processors not belonging to CONTEXT_A or CONTEXT_B, the element desc_A[CTXT] or desc_B[CTXT], respectively, must be set to -1 on that processor.
descinit_, in theory, does this elegantly, since the CONTEXT values returned by Cblacs_gridinit are -1 on any processor not participating in that context's grid. However, descinit_ will not generate the correct matrix descriptor on any processor which does not participate in the grid due to the restriction detailed above for negative values of NPROW and NPCOL.
To do proper disjoint grid communication, such a matrix descriptor must be defined on all processors which participate in either context.
Clearly, pdgemr2d_ cannot have been written with this as an insurmountable flaw, as the function description in the code specifically states:
PDGEMR2D copies a submatrix of A on a submatrix of B.
A and B can have different distributions: they can be on different
processor grids, they can have different blocksizes, the beginning
of the area to be copied can be at a different places on A and B.
Thanks very much for any help, I know this is a fairly specialized question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在尝试弄清楚如何使用 PDGEMR2D 时也遇到了类似的困难,因此我想在这里分享我的结论。
简而言之,如果您尝试设置多个上下文,请不要使用提供的 DESCINIT 子例程。它的错误检查假设所有进程都参与正在初始化的数组描述符的上下文,但如果您尝试使用 PDGEMR2D,则情况并非如此。
您可以轻松地初始化自己的描述符,而无需使用 DESCINIT,因为它只是一个长度为 9 的整数数组。前 8 个字段(dtype、ctxt、m、n、mb、nb、csrc 和 rsrc)是全局的,并且应该具有所有进程上的值相同。只有第 9 个字段 (lld) 是本地字段,并且在不是定义数组的上下文成员的进程上,它的值将被忽略。
ScaLAPACK 源代码中的示例程序 pdgemrdrv.c (在线版本 此处< /a>) 当我试图解决这个问题时对我很有用。它包含很多不必要的复杂性,但您可以推断出以下关键点:
希望这有帮助。干杯。
I had similar difficulties in trying to figure out how to use PDGEMR2D, and thought I'd share my conclusions here.
In short, don't use the supplied DESCINIT subroutine if you're trying to set up multiple contexts. Its error checking assumes that all processes participate in the context for the array descriptor being initialized, which isn't the case if you're trying to use PDGEMR2D.
You can easily initialize your own descriptor without using DESCINIT, as it's just an integer array of length 9. The first 8 fields (dtype, ctxt, m, n, mb, nb, csrc, and rsrc) are global, and should have the same value on all processes. Only the 9th field (lld) is local, and its value is ignored on processes that are not members of the context in which the array is being defined.
The example program pdgemrdrv.c in the ScaLAPACK sources (online version here) was useful for me when I was trying to figure this stuff out. It contains a lot of unnecessary complication, but you can deduce the following key points:
Hope this helps. Cheers.