当前位置：文江博客话题详情

测试 MPI_Barrier C++

发布于 2024-08-17 15:19:13 字数 45 浏览 9 评论 0原文

我如何确保 MPI_Barrier 正确运行？测试方法是什么？
谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-08-24 15:19:13

我认为，为了确保 MPI_Barrier 正常工作，您必须编写一个程序，保证对工作和非工作屏障表现不同。

我不认为@Neeraj 的答案一定会这样。如果屏障工作正常，所有进程都将在写入第二个输出行之前写入其第一个输出行。然而，即使在没有障碍的情况下（或者如果你想这样想的话，障碍已经完全失效），这种情况也有可能发生。我的主张并不取决于他建议的非常短的睡眠时间（5 毫秒等级）。即使您假设进程等待（5 秒），在没有屏障的情况下，语句也可能会按照屏障强加的顺序出现。我不太可能同意你的说法，但并非不可能，特别是当你必须考虑操作系统如何缓冲对标准输出的多次写入时——你实际上可能正在测试该进程而不是屏障。哦你哭了即使是最不准确的计算机时钟也会导致进程 1 等待的时间比进程 2 短，以显示屏障的正确工作。如果 o/s 则不然。抢占处理器 1（进程 1 试图在其上运行）10 秒，但没有。

对板载时钟进行同步的依赖实际上使程序的确定性降低。所有处理器都有自己的时钟，并且硬件不能保证它们都以完全相同的速率或完全相同的滴答长度进行滴答。

该测试也没有充分探索屏障的所有失效模式。充其量它只是探索彻底的失败；如果实现实际上是一个泄漏屏障，因此偶尔会有一个进程在最后一个进程到达屏障之前通过，该怎么办？相差一错误在程序中非常常见。或者，屏障代码可能是 3 年前编写的，只有足够的内存来记录 2^12==4096 个进程的到来，而您已将其放在具有 2^18 个处理器的全新机器上；屏障与其说是水坝，不如说是堰。

直到现在我还没有深入思考过这个问题，我从来没有怀疑过我使用过的任何 MPI 实现有错误的屏障，所以我没有关于如何彻底测试屏障的好建议。我倾向于使用并行调试器并通过屏障检查程序的执行情况，但这并不能保证正确的行为。

但这是一个有趣的问题。

I think that to be sure that the MPI_Barrier is working correctly you have to write a program which is guaranteed to behave differently for working and non-working barriers.

I don't think that @Neeraj's answer is guaranteed to behave that way. If the barrier is working correctly the processes will all write their first output lines before any writes a second output line. However it is possible that this will happen even in the absence of the barrier (or where the barrier has failed completely if you want to think of it this way). My assertion does not depend on the very short sleep times he suggests (5msrank). Even if you suppose that the processes wait (5srank) it is possible that the statements would appear in the barrier-imposed order in the absence of the barrier. Unlikely I grant you, but not impossible, especially when you have to consider how the o/s buffers multiple writes to stdout -- you might actually be testing that process not the barrier. Oh you cry even the most inaccurate computer clock will result in process 1 waiting enough less time than process 2 to show the correct working of the barrier. Not if the o/s preemptively grabs processor 1 (on which process 1 is trying to run) for 10s it doesn't.

Dependence on the on-board clocks for synchronisation actually makes the program less deterministic. All the processors have their own clocks, and the hardware doesn't make any guarantees that they all tick at exactly the same rate or with exactly the same tick length.

Nor does that test adequately explore all the failure modes of the barrier. At best it only explores the complete failure; what if the implementation is actually a leaky barrier, so that occasionally a process gets through before the last process has reached the barrier ? Off-by-one errors are incredibly common in programs. Or perhaps the barrier code was written 3 years ago and only has enough memory to record the arrival of, say, 2^12==4096 processes and you've put it on a brand new machine with 2^18 processors; the barrier is more of a weir than a dam.

I haven't thought about this deeply until now, I've never suspected that any of the MPI implementations I've used had faulty barriers, so I don't have a good suggestion about how to thoroughly test a barrier. I'd be inclined to use a parallel debugger and examine the execution of the program through the barrier, but that's not going to provide a guarantee of correct behaviour.

It's an interesting question though.

回复收藏 0 原文

风吹雪碎 2024-08-24 15:19:13


#include <mpi.h>

int main (int argc , char *argv[])
{
  int rank;

  MPI_Init (&argc, &argv);      /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */

  sleep(5*rank); // make sure each process waits for different amount of time
  std::cout << "Synchronization point for:" << rank << std::endl ;
  MPI_Barrier(MPI_COMM_WORLD) ;
  std::cout << "After Synchronization, id:" << rank << std::endl ;

  MPI_Finalize();
  return 0;
}


#include <mpi.h>

int main (int argc , char *argv[])
{
  int rank;

  MPI_Init (&argc, &argv);      /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */

  sleep(5*rank); // make sure each process waits for different amount of time
  std::cout << "Synchronization point for:" << rank << std::endl ;
  MPI_Barrier(MPI_COMM_WORLD) ;
  std::cout << "After Synchronization, id:" << rank << std::endl ;

  MPI_Finalize();
  return 0;
}

回复收藏 0 原文

全部不再 2024-08-24 15:19:13

Allen Downey 在他的书 The Little Book of Semaphores 中这样说道（关于他提出的可重用屏障算法）：

不幸的是，这个解决方案是
最不平凡的典型
同步代码：很难
确保解决方案是正确的。
通常有一种微妙的方式
程序的特定路径
可能会导致错误。
更糟糕的是，测试
解决方案的实施不是
有很大帮助。该错误可能会发生得很
很少因为特定的路径
这导致它可能需要
非常不幸的组合
情况。类似这样的错误几乎都是
无法重现和调试
常规手段。
唯一的选择是检查
仔细编码并“证明”它是
正确的。我把“证明”放在引号里
标记，因为我的意思不是，
必然地，你必须写一个
形式证明（虽然有
鼓励这种疯狂行为的狂热分子）。