并行化C＆＃x2B;＆＃x2B;使用MPI_SEND和MPI_RECV代码

发布于 2025-02-11 16:54:57 字数 3100 浏览 1 评论 0原文

我有一个并行的代码，但是我不明白它是否并行工作。我有两个向量A和B，其元素是按适当类定义的矩阵。由于向量中的矩阵不是原始类型，因此我无法通过MPI_Scatter将这些向量发送到其他等级，因此我必须使用MPI_SEND和MPI_RECV。此外，排名0仅具有协调角色：它将其发送到其他等级，它们应与之合作并在最后收集结果，但不参与计算。

练习的解决方案如下：

// rank 0 sends the blocks to the other ranks, which compute the local
// block products, then receive the partial results and prints the global
// vector
if (rank == 0)
{
    // send data
    for (unsigned j = 0; j < N_blocks; ++j) 
    {
        int dest = j / local_N_blocks + 1;
        // send number of rows
        unsigned n = A[j].rows();
        MPI_Send(&n, 1, MPI_UNSIGNED, dest, 1, MPI_COMM_WORLD);
        // send blocks
        MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
        MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);
    }

    // global vector
    std::vector<dense_matrix> C(N_blocks);

    for (unsigned j = 0; j < N_blocks; ++j) 
    {
        int root = j / local_N_blocks + 1;
        // receive number of rows
        unsigned n;
        MPI_Recv(&n, 1, MPI_UNSIGNED, root, 4, MPI_COMM_WORLD,
        MPI_STATUS_IGNORE);
        // initialize blocks
        dense_matrix received(n,n);
        // receive blocks
        MPI_Recv(received.data(), n*n, MPI_DOUBLE, root, 5,
        MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        // store block in the vector
        C[j] = received;
     }
     // print result
     print_matrix(C); 
}
    
    // all the other ranks receive the blocks and compute the local block
    // products, then send the results to rank 0
}
else
{
        // local vector
        std::vector<dense_matrix> local_C(local_N_blocks);
        // receive data and compute products
        for (unsigned j = 0; j < local_N_blocks; ++j) 
        {
            // receive number of rows
            unsigned n;
            MPI_Recv(&n, 1, MPI_UNSIGNED, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // initialize blocks
            dense_matrix local_A(n,n); dense_matrix local_B(n,n);
            // receive blocks
            MPI_Recv(local_A.data(), n*n, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            MPI_Recv(local_B.data(), n*n, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // compute product
            local_C[j] = local_A * local_B; 
         }
         // send local results
         for (unsigned j = 0; j < local_N_blocks; ++j) 
         {
            // send number of rows
            unsigned n = local_C[j].rows();
            MPI_Send(&n, 1, MPI_UNSIGNED, 0, 4, MPI_COMM_WORLD);
            // send block
            MPI_Send(local_C[j].data(), n*n, MPI_DOUBLE, 0, 5, MPI_COMM_WORLD); 
         }
}

我认为，如果local_n_blocks = n_blocks/（size -1）;与1不同，则变量dest不更改在每个循环迭代中的价值。因此，在“发送循环”的第一次迭代之后，第二次排名0面对的

MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);

是必须等待操作local_c [j] = local_a * local_b上一个j的local_b 已经完成在我看来，该代码似乎没有很好地平行。你怎么认为？

原文

I have a parallel code, but I don't understand if it works correctly in parallel.
I have two vectors A and B whose elements are matrices defined with a proper class.
Since the matrices in the vectors are not primitive type I can't send these vectors to other ranks through MPI_Scatter, so I have to use MPI_Send and MPI_Recv. Also, rank 0 has only a coordination role: it sends to the other ranks the blocks they should work with and collects the results at the end, but it does not participate to the computation.

The solution of the exercise is the following:

// rank 0 sends the blocks to the other ranks, which compute the local
// block products, then receive the partial results and prints the global
// vector
if (rank == 0)
{
    // send data
    for (unsigned j = 0; j < N_blocks; ++j) 
    {
        int dest = j / local_N_blocks + 1;
        // send number of rows
        unsigned n = A[j].rows();
        MPI_Send(&n, 1, MPI_UNSIGNED, dest, 1, MPI_COMM_WORLD);
        // send blocks
        MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
        MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);
    }

    // global vector
    std::vector<dense_matrix> C(N_blocks);

    for (unsigned j = 0; j < N_blocks; ++j) 
    {
        int root = j / local_N_blocks + 1;
        // receive number of rows
        unsigned n;
        MPI_Recv(&n, 1, MPI_UNSIGNED, root, 4, MPI_COMM_WORLD,
        MPI_STATUS_IGNORE);
        // initialize blocks
        dense_matrix received(n,n);
        // receive blocks
        MPI_Recv(received.data(), n*n, MPI_DOUBLE, root, 5,
        MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        // store block in the vector
        C[j] = received;
     }
     // print result
     print_matrix(C); 
}
    
    // all the other ranks receive the blocks and compute the local block
    // products, then send the results to rank 0
}
else
{
        // local vector
        std::vector<dense_matrix> local_C(local_N_blocks);
        // receive data and compute products
        for (unsigned j = 0; j < local_N_blocks; ++j) 
        {
            // receive number of rows
            unsigned n;
            MPI_Recv(&n, 1, MPI_UNSIGNED, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // initialize blocks
            dense_matrix local_A(n,n); dense_matrix local_B(n,n);
            // receive blocks
            MPI_Recv(local_A.data(), n*n, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            MPI_Recv(local_B.data(), n*n, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // compute product
            local_C[j] = local_A * local_B; 
         }
         // send local results
         for (unsigned j = 0; j < local_N_blocks; ++j) 
         {
            // send number of rows
            unsigned n = local_C[j].rows();
            MPI_Send(&n, 1, MPI_UNSIGNED, 0, 4, MPI_COMM_WORLD);
            // send block
            MPI_Send(local_C[j].data(), n*n, MPI_DOUBLE, 0, 5, MPI_COMM_WORLD); 
         }
}

In my opinion, if local_N_blocks= N_blocks / (size - 1); is different from 1, the variable dest doesn't change value at every loop iteration. So, after the first iteration of the "sending loop", the second time that rank 0 faces

MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);

it has to wait that the operation local_C[j] = local_A * local_B of the previous j has been completed so the code doesn't seem to me well parallelized.
What do you think?

分享到QQ

分享到微博