MPI_Allgather 产生不一致的结果

发布于 2024-12-20 07:21:44 字数 1279 浏览 3 评论 0原文

我在一个大型软件的一部分中遇到了 MPI_Allgather 问题。

下面的函数传递了每个节点上不同的双精度和相关标志,然后该函数应该找到全局最小双精度,并将所有节点设置为相应的值。

void set_dt_to_global_min (double *dt, int *flag) {
    int ierr, size;
    ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size == 1)
        return;

    typedef struct DT_FLAG_ {
        double dt;
        int flag;
    } DT_FLAG;

    DT_FLAG local;
    DT_FLAG *gathered = (DT_FLAG *) malloc(size * sizeof(*gathered));

    local.dt = *dt;
    local.flag = *flag;

    MPI_Allgather(&local, sizeof(DT_FLAG), MPI_BYTE, gathered, sizeof(DT_FLAG), MPI_BYTE, MPI_COMM_WORLD);

    int i, imin;
    for (imin = 0, i = 1; i < size; ++i) {
        if (gathered[imin].dt > gathered[i].dt) {
            imin = i;
        }
    }

    *dt = gathered[imin].dt;
    *flag = gathered[imin].flag;

    free(gathered);
}

我目前在 6 个节点上运行此程序,发现以下错误发生在节点 5(其 dt 值最小)上:

  • gathered[0]gathered[2] 替换
  • gathered[1] 的真实值被 gathered[3] 替换

我想也许这有与某事有关MPI_COMM_WORLD,因为可能会调用 MPI_Comm_Split();然而,到目前为止,我还不明白这部分代码。

有人有什么想法吗?

-- 编辑:更新了问题以反映我们实际上需要保留一个也与 dt 相关联的标志 -- 这意味着 @suszterpatt 建议对于我最初的问题来说非常有用,但实际上不会为此工作(我不认为)。

I am having trouble with MPI_Allgather in part of a much large piece of software.

The following function gets passed a double and related flag that is different on each node, the function is then supposed to find the globally minimum double, and set all nodes to the corresponding values.

void set_dt_to_global_min (double *dt, int *flag) {
    int ierr, size;
    ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size == 1)
        return;

    typedef struct DT_FLAG_ {
        double dt;
        int flag;
    } DT_FLAG;

    DT_FLAG local;
    DT_FLAG *gathered = (DT_FLAG *) malloc(size * sizeof(*gathered));

    local.dt = *dt;
    local.flag = *flag;

    MPI_Allgather(&local, sizeof(DT_FLAG), MPI_BYTE, gathered, sizeof(DT_FLAG), MPI_BYTE, MPI_COMM_WORLD);

    int i, imin;
    for (imin = 0, i = 1; i < size; ++i) {
        if (gathered[imin].dt > gathered[i].dt) {
            imin = i;
        }
    }

    *dt = gathered[imin].dt;
    *flag = gathered[imin].flag;

    free(gathered);
}

I am running this on 6 nodes currently and I find the following error occurs only on Node 5 (which has the smallest value of dt):

  • the true value of gathered[0] is replaced by gathered[2]
  • the true value of gathered[1] is replaced by gathered[3]

I thought that perhaps this has something to do with MPI_COMM_WORLD as there is potentially a call to MPI_Comm_Split(); however, as yet, I do not understand that part of the code.

Anyone have any ideas?

-- EDIT: Updated the question to reflect that we actually need to hold on to a flag that is also associated to dt -- this means @suszterpatt suggestion is great for my initial question, but in fact wont work (I don't think) for this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

亚希 2024-12-27 07:21:44

最近对 mpi-default-dev 的更新似乎已经解决了这个问题——当我弄清楚哪些更改可能解决了这个问题时,我将发布更多详细信息。

A recent update to mpi-default-dev seems to have fixed the problem -- I'll post more details when I can work out what change might have fixed it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文