MPI_Allgather 产生不一致的结果
我在一个大型软件的一部分中遇到了 MPI_Allgather 问题。
下面的函数传递了每个节点上不同的双精度和相关标志,然后该函数应该找到全局最小双精度,并将所有节点设置为相应的值。
void set_dt_to_global_min (double *dt, int *flag) {
int ierr, size;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size == 1)
return;
typedef struct DT_FLAG_ {
double dt;
int flag;
} DT_FLAG;
DT_FLAG local;
DT_FLAG *gathered = (DT_FLAG *) malloc(size * sizeof(*gathered));
local.dt = *dt;
local.flag = *flag;
MPI_Allgather(&local, sizeof(DT_FLAG), MPI_BYTE, gathered, sizeof(DT_FLAG), MPI_BYTE, MPI_COMM_WORLD);
int i, imin;
for (imin = 0, i = 1; i < size; ++i) {
if (gathered[imin].dt > gathered[i].dt) {
imin = i;
}
}
*dt = gathered[imin].dt;
*flag = gathered[imin].flag;
free(gathered);
}
我目前在 6 个节点上运行此程序,发现以下错误仅发生在节点 5(其 dt 值最小)上:
gathered[0]
被gathered[2]
替换gathered[1]
的真实值被gathered[3]
替换
我想也许这有与某事有关MPI_COMM_WORLD,因为可能会调用 MPI_Comm_Split();然而,到目前为止,我还不明白这部分代码。
有人有什么想法吗?
-- 编辑:更新了问题以反映我们实际上需要保留一个也与 dt 相关联的标志 -- 这意味着 @suszterpatt 建议对于我最初的问题来说非常有用,但实际上不会为此工作(我不认为)。
I am having trouble with MPI_Allgather in part of a much large piece of software.
The following function gets passed a double and related flag that is different on each node, the function is then supposed to find the globally minimum double, and set all nodes to the corresponding values.
void set_dt_to_global_min (double *dt, int *flag) {
int ierr, size;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size == 1)
return;
typedef struct DT_FLAG_ {
double dt;
int flag;
} DT_FLAG;
DT_FLAG local;
DT_FLAG *gathered = (DT_FLAG *) malloc(size * sizeof(*gathered));
local.dt = *dt;
local.flag = *flag;
MPI_Allgather(&local, sizeof(DT_FLAG), MPI_BYTE, gathered, sizeof(DT_FLAG), MPI_BYTE, MPI_COMM_WORLD);
int i, imin;
for (imin = 0, i = 1; i < size; ++i) {
if (gathered[imin].dt > gathered[i].dt) {
imin = i;
}
}
*dt = gathered[imin].dt;
*flag = gathered[imin].flag;
free(gathered);
}
I am running this on 6 nodes currently and I find the following error occurs only on Node 5 (which has the smallest value of dt):
- the true value of
gathered[0]
is replaced bygathered[2]
- the true value of
gathered[1]
is replaced bygathered[3]
I thought that perhaps this has something to do with MPI_COMM_WORLD as there is potentially a call to MPI_Comm_Split(); however, as yet, I do not understand that part of the code.
Anyone have any ideas?
-- EDIT: Updated the question to reflect that we actually need to hold on to a flag that is also associated to dt
-- this means @suszterpatt suggestion is great for my initial question, but in fact wont work (I don't think) for this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最近对
mpi-default-dev
的更新似乎已经解决了这个问题——当我弄清楚哪些更改可能解决了这个问题时,我将发布更多详细信息。A recent update to
mpi-default-dev
seems to have fixed the problem -- I'll post more details when I can work out what change might have fixed it.