测试 MPI_Barrier C++
我如何确保 MPI_Barrier 正确运行?测试方法是什么?
谢谢
How can I be sure that MPI_Barrier act correctly? What's the method of test for that?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为,为了确保 MPI_Barrier 正常工作,您必须编写一个程序,保证对工作和非工作屏障表现不同。
我不认为@Neeraj 的答案一定会这样。如果屏障工作正常,所有进程都将在写入第二个输出行之前写入其第一个输出行。然而,即使在没有障碍的情况下(或者如果你想这样想的话,障碍已经完全失效),这种情况也有可能发生。我的主张并不取决于他建议的非常短的睡眠时间(5 毫秒等级)。即使您假设进程等待(5 秒),在没有屏障的情况下,语句也可能会按照屏障强加的顺序出现。我不太可能同意你的说法,但并非不可能,特别是当你必须考虑操作系统如何缓冲对标准输出的多次写入时——你实际上可能正在测试该进程而不是屏障。 哦你哭了即使是最不准确的计算机时钟也会导致进程 1 等待的时间比进程 2 短,以显示屏障的正确工作。如果 o/s 则不然。抢占处理器 1(进程 1 试图在其上运行)10 秒,但没有。
对板载时钟进行同步的依赖实际上使程序的确定性降低。所有处理器都有自己的时钟,并且硬件不能保证它们都以完全相同的速率或完全相同的滴答长度进行滴答。
该测试也没有充分探索屏障的所有失效模式。充其量它只是探索彻底的失败;如果实现实际上是一个泄漏屏障,因此偶尔会有一个进程在最后一个进程到达屏障之前通过,该怎么办?相差一错误在程序中非常常见。或者,屏障代码可能是 3 年前编写的,只有足够的内存来记录 2^12==4096 个进程的到来,而您已将其放在具有 2^18 个处理器的全新机器上;屏障与其说是水坝,不如说是堰。
直到现在我还没有深入思考过这个问题,我从来没有怀疑过我使用过的任何 MPI 实现有错误的屏障,所以我没有关于如何彻底测试屏障的好建议。我倾向于使用并行调试器并通过屏障检查程序的执行情况,但这并不能保证正确的行为。
但这是一个有趣的问题。
I think that to be sure that the MPI_Barrier is working correctly you have to write a program which is guaranteed to behave differently for working and non-working barriers.
I don't think that @Neeraj's answer is guaranteed to behave that way. If the barrier is working correctly the processes will all write their first output lines before any writes a second output line. However it is possible that this will happen even in the absence of the barrier (or where the barrier has failed completely if you want to think of it this way). My assertion does not depend on the very short sleep times he suggests (5msrank). Even if you suppose that the processes wait (5srank) it is possible that the statements would appear in the barrier-imposed order in the absence of the barrier. Unlikely I grant you, but not impossible, especially when you have to consider how the o/s buffers multiple writes to stdout -- you might actually be testing that process not the barrier. Oh you cry even the most inaccurate computer clock will result in process 1 waiting enough less time than process 2 to show the correct working of the barrier. Not if the o/s preemptively grabs processor 1 (on which process 1 is trying to run) for 10s it doesn't.
Dependence on the on-board clocks for synchronisation actually makes the program less deterministic. All the processors have their own clocks, and the hardware doesn't make any guarantees that they all tick at exactly the same rate or with exactly the same tick length.
Nor does that test adequately explore all the failure modes of the barrier. At best it only explores the complete failure; what if the implementation is actually a leaky barrier, so that occasionally a process gets through before the last process has reached the barrier ? Off-by-one errors are incredibly common in programs. Or perhaps the barrier code was written 3 years ago and only has enough memory to record the arrival of, say, 2^12==4096 processes and you've put it on a brand new machine with 2^18 processors; the barrier is more of a weir than a dam.
I haven't thought about this deeply until now, I've never suspected that any of the MPI implementations I've used had faulty barriers, so I don't have a good suggestion about how to thoroughly test a barrier. I'd be inclined to use a parallel debugger and examine the execution of the program through the barrier, but that's not going to provide a guarantee of correct behaviour.
It's an interesting question though.
Allen Downey 在他的书 The Little Book of Semaphores 中这样说道(关于他提出的可重用屏障算法):
Allen Downey in his book The Little Book of Semaphores says this (about a reusable barrier algorithm he presents):