继续重试 HPC 任务,直到资源可用 (Windows HPC Server 2008 R2 SP3)
HPC 任务要么成功,要么失败,但是如何将“稍后再试”传达给调度程序,当然我可以使用失败并重新提交任务,但我需要一种方法来确定我是否因为某些内容损坏而失败(放弃)或者稍后重试,因为此任务正在等待另一个任务(并继续尝试,直到我们出错或成功完成)
是否有办法使用 HPC API 或类似的方法来实现此目的?据我所知,任何非零的东西都是失败,零都是成功,就是这样,肯定有一个很好的方法来实现这种“稍后尝试”的行为。
背景
我们试图在一个作业中运行多个 HPC 任务,这些任务之间存在复杂的相互依赖关系,因为当第一个任务正在执行时,其他任务会等待,直到第一个任务处理完足够的时间。数据,以便它们可以开始(有点级联执行,但不是以任何简单的顺序,因此我们无法在 HPC 中定义依赖关系)。
最初,我试图让这些多个任务在多个核心之间共享,这样它们就可以在等待主任务完成他们感兴趣的任务时休眠。类似于 Windows 分时进程的方式。现在很明显,HPC(根据设计!)仅允许 每个核心一个任务 因此,如果您有一台八核机器,则一次只能运行八个任务。
解决方案似乎是使用批处理文件或类似的方式生成多个进程,但是在我走这条路之前,我想知道上述问题是否可行。
A HPC task either succeeds or fails, but how do I communicate "try later" back to the scheduler, sure I could use fail and resubmit the task but I need a way determine if I've failed because something is broken (give up) or try again shortly because this task is waiting on another task (and keep trying until we either error or successfully complete)
Is there a way to achieve this using the HPC API or similar? From what I've heard anything non-zero is failure and zero is success, that's it, surely there must be a nice way to achieve this "try later" behavour.
Background
We are attempting to run a number of HPC tasks in a single job that have complex interdepencences between them in that as the first task is executing other tasks sit and wait until the first task as processed enough of the data so they can make a start (sort of a cascading execution but not in any easy order so we can't define dependences in HPC).
Initally I was trying to get these multiple tasks shared across multiple cores in such a way they could sleep while waiting for the main task to complete the task they are interested in. Similar to how Windows would timeshare processes. It's now clear HPC (by design!) only allows one task per core so if you have an eight core machine you can only run eight tasks at once.
The solution appears to be use a batch file or similar to spawn multiple processes, however before I go down that path I'd like to know if the above question is feasible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)