继续重试 HPC 任务,直到资源可用 (Windows HPC Server 2008 R2 SP3)

发布于 2024-11-26 13:44:30 字数 708 浏览 0 评论 0原文

HPC 任务要么成功,要么失败,但是如何将“稍后再试”传达给调度程序,当然我可以使用失败并重新提交任务,但我需要一种方法来确定我是否因为某些内容损坏而失败(放弃)或者稍后重试,因为此任务正在等待另一个任务(并继续尝试,直到我们出错或成功完成)

是否有办法使用 HPC API 或类似的方法来实现此目的?据我所知,任何非零的东西都是失败,零都是成功,就是这样,肯定有一个很好的方法来实现这种“稍后尝试”的行为。

背景

我们试图在一个作业中运行多个 HPC 任务,这些任务之间存在复杂的相互依赖关系,因为当第一个任务正在执行时,其他任务会等待,直到第一个任务处理完足够的时间。数据,以便它们可以开始(有点级联执行,但不是以任何简单的顺序,因此我们无法在 HPC 中定义依赖关系)。

最初,我试图让这些多个任务在多个核心之间共享,这样它们就可以在等待主任务完成他们感兴趣的任务时休眠。类似于 Windows 分时进程的方式。现在很明显,HPC(根据设计!)仅允许 每个核心一个任务 因此,如果您有一台八核机器,则一次只能运行八个任务。

解决方案似乎是使用批处理文件或类似的方式生成多个进程,但是在我走这条路之前,我想知道上述问题是否可行。

A HPC task either succeeds or fails, but how do I communicate "try later" back to the scheduler, sure I could use fail and resubmit the task but I need a way determine if I've failed because something is broken (give up) or try again shortly because this task is waiting on another task (and keep trying until we either error or successfully complete)

Is there a way to achieve this using the HPC API or similar? From what I've heard anything non-zero is failure and zero is success, that's it, surely there must be a nice way to achieve this "try later" behavour.

Background

We are attempting to run a number of HPC tasks in a single job that have complex interdepencences between them in that as the first task is executing other tasks sit and wait until the first task as processed enough of the data so they can make a start (sort of a cascading execution but not in any easy order so we can't define dependences in HPC).

Initally I was trying to get these multiple tasks shared across multiple cores in such a way they could sleep while waiting for the main task to complete the task they are interested in. Similar to how Windows would timeshare processes. It's now clear HPC (by design!) only allows one task per core so if you have an eight core machine you can only run eight tasks at once.

The solution appears to be use a batch file or similar to spawn multiple processes, however before I go down that path I'd like to know if the above question is feasible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦太阳 2024-12-03 13:44:30
  1. 不幸的是,任务失败时无法导致调度程序重试。
  2. 正如您所怀疑的,推荐的方法是让每个任务运行一个批处理文件或 powershell 脚本来启动您想要启动的所有进程。
  3. 如果您不想这样做,SP2 中的 HPC 调度程序现在允许核心超额订阅(每个核心多个任务),这可能适用于解决您的问题。有关如何设置的指南,请参阅此处:超额订阅集群节点上的核心数量
  1. Unfortunately, there's no way for a task to fail in a way that will cause the scheduler to retry it.
  2. As you suspected, the recommended way to do it would be to have each task run a batch file or powershell script that starts all the processes you want started.
  3. If you don't want to do that, the HPC scheduler in SP2 now allows core over-subscription (more than one task per core), which might be applicable to solving your problem. See here for a guide on how to set it up: Oversubscribe core counts on cluster nodes
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文