如何在 MPI 中定义需要多个输入缓冲区的用户定义函数

发布于 2025-01-19 18:46:06 字数 4301 浏览 1 评论 0原文

我需要在 MPI 中定义用户定义的缩减。在每个处理器中,我有 3 个向量,其中一个是双精度向量,另外 2 个是整数。我无法将这些向量展平为一维数据并通过我的用户定义的函数。另外,我无法使用 MPI_create_struct 和用户定义的数据类型,因为这些向量的大小在不同的处理器中有所不同。我知道用户定义的函数作为示例应该是这样的

void my_sum_function(void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype)
{
    int*input = (int*)inputBuffer;
    int* output = (int*)outputBuffer;
    for (int i = 0; i < *len; i++) {

        output[i] += input[i];
    }
} 

,但我正在寻找一种让我的用户定义的函数接受多个输入缓冲区的方法,我想知道这是否可能,如果是的话如何?如果我可以使用结构,它应该就像

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <vector>
#include <iostream>

typedef std::vector<int>  VecInt_t;
typedef std::vector<double>  VecDbl_t;
typedef std::vector<VecInt_t>  VecVecInt_t;


struct vecs
{

    VecDbl_t val_;
    VecInt_t L2G_;
    VecInt_t G2L_;

};

void my_sum_function(void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype)
{
    vecs *input = (vecs*)inputBuffer;
    double* output = (double*)outputBuffer;
    for (int i = 0; i < (*input).L2G_.size(); i++) {

        output[(*input).L2G_[i]] += (*input).val_[(*input).G2L_[(*input).L2G_[i]]];
    }

}


int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int root_rank = 0;

    MPI_Op operation;
    MPI_Op_create(&my_sum_function, 1, &operation);

    MPI_Datatype mytype;
    vecs p;
    MPI_Datatype types[3] = { MPI_DOUBLE, MPI_INT, MPI_INT };
    int lengths[3] = { p.val_.size(),p.L2G_.size(),p.G2L_.size() };
    MPI_Aint displacements[3] = { (MPI_Aint)&p.val_, (MPI_Aint)&p.L2G_, (MPI_Aint)&p.G2L_ };
    for (int i = 1; i < 3; i++) displacements[i] -= displacements[0];
    displacements[0] = 0;
    MPI_Type_create_struct(3, lengths, displacements, types, &mytype);
    MPI_Type_commit(&mytype);

    vecs buffer;

    if (rank == 0)
    {

        buffer.val_ = { 3,2,5 };
        buffer.L2G_= { 0,1,2 };
        buffer.G2L_= { 0,1,2,-1 };

    }
    else
    {

        buffer.val_ = { 4,3,5 };
        buffer.L2G_ = { 0,2,3 };
        buffer.G2L_ = { 0,-1,1,2 };
    }

    double reduction_results[4] = { 0, 0,0,0};
    MPI_Reduce(&buffer, reduction_results, 4, mytype, operation, root_rank, MPI_COMM_WORLD);


    if (rank == root_rank)
    {

        printf("The sum of first elements of data is %g.\n", reduction_results[0]);
        printf("The sum of second elements of data is %g.\n", reduction_results[1]);
        printf("The sum of third elements of data is %g.\n", reduction_results[2]);
        printf("The sum of fourth elements of data is %g.\n", reduction_results[3]);
    }

    MPI_Type_free(&mytype);
    MPI_Op_free(&operation);
    MPI_Finalize();
    return EXIT_SUCCESS;
}

我会在一个简单的案例中解释我在做什么。我有 2 个带有 4 个节点的三角形元素,并且为所有处理器中的每个元素构建刚度矩阵。在这种情况下,全局刚度矩阵是分布式的。 连接如下。 元素 0: 0,1,2------> 转到处理器零 元素 1:0,2,3------> 转到处理器 1 您会看到节点 0 和 2 在处理器之间共享。在这种情况下,我的局部刚度矩阵将为 3 x 3,而不是 4 x 4。我不共享整个向量,而是创建一个包含当前处理器中存在的所有节点的向量。例如,处理器 0 中的刚度矩阵将乘以大小为 3 的向量。该向量包含节点 0、1、2 的本地结果。所以我定义了Local2global {0,1,2}。该向量表示当前处理器中存在全局索引中的哪个节点。此外,我还定义了另一个大小为 4 的辅助向量 global2local,由 -1 初始化用于本地索引,在处理器 0 的情况下为 0,1,2,-1。 对于第二个处理器,刚度矩阵将乘以包含节点 0,2 和 3 的局部结果的大小为 3 的向量。Local2global 为 0,2,3。global2local 为 0,-1,1,2。现在每个处理器最终都会得到一个大小为 3 的结果向量。假设 处理器 0:{3=节点 0 上的值,2=节点 1 上的值,5=节点 2 上的值}。 处理器 1:{4= 节点 0 上的值,3= 节点 2 上的值,5= 节点 3 上的值}。 现在我需要将结果求和到处理器 0 中的全局结果向量中。处理器之间未共享的那些节点将直接转到它们在全局结果向量中的位置,但对于那些共享的节点,我需要将它们求和并除以它们在处理器之间重复的次数。处理器零已经具有重复向量。所以最后是 global_result[i]/MPI_reps[i]。 在这种情况下,全局向量将为{3.5,2,4,5}。

如果我可以编写用户定义操作,以便

For (int i=0;i<local2global.size();i++){

global_result [i]+=local_result [global2local[local2global[i]]];
}

我可以收集结果。 现在我在定义用户定义的操作时遇到问题。因为为了执行上面的代码,我需要 local_result、global2local 和 local2global 将用户定义的函数传递到输入缓冲区中。用户定义的函数类似于 (void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype)。我在这里遇到了一些问题。首先,我在将这些向量制成一维数组或向量时遇到问题。因为它们总体上有不同的类型。其次,我不能使用 int MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displacements[], MPI_Datatype block_types[], MPI_Datatype* new_datatype);至于块大小的大小不是恒定的。我希望我现在能更清楚地解释我的问题。

I need to define a user defined reduction in MPI. In each processor I have 3 vectors, one of them is double and the other 2 are integers. I can't flatten these vectors into one dimension data and pass through my user defined function. Also I can't use MPI_create_struct and user defined datatype because the size of these vectors vary in different processors. I know the user defined function as an example should be like

void my_sum_function(void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype)
{
    int*input = (int*)inputBuffer;
    int* output = (int*)outputBuffer;
    for (int i = 0; i < *len; i++) {

        output[i] += input[i];
    }
} 

but I'm looking for a way that my user defined function takes several input buffer, I wonder if this is possible and if yes how?in case if I could use struct it should be something like

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <vector>
#include <iostream>

typedef std::vector<int>  VecInt_t;
typedef std::vector<double>  VecDbl_t;
typedef std::vector<VecInt_t>  VecVecInt_t;


struct vecs
{

    VecDbl_t val_;
    VecInt_t L2G_;
    VecInt_t G2L_;

};

void my_sum_function(void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype)
{
    vecs *input = (vecs*)inputBuffer;
    double* output = (double*)outputBuffer;
    for (int i = 0; i < (*input).L2G_.size(); i++) {

        output[(*input).L2G_[i]] += (*input).val_[(*input).G2L_[(*input).L2G_[i]]];
    }

}


int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int root_rank = 0;

    MPI_Op operation;
    MPI_Op_create(&my_sum_function, 1, &operation);

    MPI_Datatype mytype;
    vecs p;
    MPI_Datatype types[3] = { MPI_DOUBLE, MPI_INT, MPI_INT };
    int lengths[3] = { p.val_.size(),p.L2G_.size(),p.G2L_.size() };
    MPI_Aint displacements[3] = { (MPI_Aint)&p.val_, (MPI_Aint)&p.L2G_, (MPI_Aint)&p.G2L_ };
    for (int i = 1; i < 3; i++) displacements[i] -= displacements[0];
    displacements[0] = 0;
    MPI_Type_create_struct(3, lengths, displacements, types, &mytype);
    MPI_Type_commit(&mytype);

    vecs buffer;

    if (rank == 0)
    {

        buffer.val_ = { 3,2,5 };
        buffer.L2G_= { 0,1,2 };
        buffer.G2L_= { 0,1,2,-1 };

    }
    else
    {

        buffer.val_ = { 4,3,5 };
        buffer.L2G_ = { 0,2,3 };
        buffer.G2L_ = { 0,-1,1,2 };
    }

    double reduction_results[4] = { 0, 0,0,0};
    MPI_Reduce(&buffer, reduction_results, 4, mytype, operation, root_rank, MPI_COMM_WORLD);


    if (rank == root_rank)
    {

        printf("The sum of first elements of data is %g.\n", reduction_results[0]);
        printf("The sum of second elements of data is %g.\n", reduction_results[1]);
        printf("The sum of third elements of data is %g.\n", reduction_results[2]);
        printf("The sum of fourth elements of data is %g.\n", reduction_results[3]);
    }

    MPI_Type_free(&mytype);
    MPI_Op_free(&operation);
    MPI_Finalize();
    return EXIT_SUCCESS;
}

I'll explain what I'm doing In a simple case. I have 2 tringle elements with 4 nodes and I construct the stiffness matrix for each element in all processors. In this case the global stiffness matrix is distributed.
The connectivity is like below.
Element 0: 0,1,2------>goes to processor zero
Element 1: 0,2,3------>goes to processor 1
You see node 0 and 2 are shared between processors. In this case my local stifness matrix will be 3 by 3 instead of 4 by 4. Instead of sharing the whole vector I just make a vector containing all the nodes that present in the current processor. For example the stiffness matrix in processor zero will multiply by a vector of size 3. This vector contains the local result for nodes 0,1,2. So I define the Local2global {0,1,2}. This vector says which node in global indexing presents in the current processor. Also I defined another auxiliary vector global2local of size 4 initialized by -1 for local indexing which in case for processor 0 is 0,1,2,-1.
For the second processor the stiffness matrix will multiply by a vector of size 3 containing local result for node 0,2 and 3. The Local2global is 0,2,3.The global2local is 0,-1,1,2. Now each processor ends up with a result vector of size 3. Let's say
Processor 0: {3=value on node 0, 2=value on node 1, 5=value on node 2}.
Processor 1: {4= value on node 0, 3= value on node 2, 5=value on node 3}.
Now I need to sum up the result into a global result vector in processor 0. Those node which are not shared between processors will go directly to their position in the global result vector but for those shared ones I need to sum them up and divide by the number that they are repeated among processors. processor zero already has the repetition vector. so at the end the global_result[i]/MPI_reps[i].
In this case the global vector will be {3.5,2,4,5}.

If I can write my user define operation such a way that

For (int i=0;i<local2global.size();i++){

global_result [i]+=local_result [global2local[local2global[i]]];
}

I can collect my result.
Now I have problem with defining the user defined operation. Because for doing above piece of code I need local_result, global2local and local2global pass into user defined function into a input buffer. The user defined function is like (void* inputBuffer, void* outputBuffer, int* len, MPI_Datatype* datatype). There are some problems for me here. First I have problem with making these vectors in one dimensional array or vector. Because they have different type in general. Second I can't use int MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displacements[], MPI_Datatype block_types[], MPI_Datatype* new_datatype); as far as the size of block size is not constant. I hope I could've explained my question more clear now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

屋顶上的小猫咪 2025-01-26 18:46:06

您给出的本地/全局/本地翻译行仅在共享内存中有效。在分布式内存中,您必须设置索引转换数据结构,然后使用收集操作来获取值。这并不有趣,但这就是分布式内存中有限元素的不幸事实。顺便说一句,有些软件包可以为您执行此操作。

The line you give with local/global/local translation only works that way in shared memory. In distributed memory you have to set up index translation data structures, and then use gather operations to get the values. That's no fun, but that's the unfortunate truth about finite elements in distributed memory. There are packages that do this for you, btw.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文