在职位终止或失败时的诽谤行动
我希望slurm
Workload Manager在作业终止时采取一些操作。怎么办?
I would like the slurm
workload manager to do some action like touch stopped.txt
at job termination either due to time out or failure. How can this be done?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当作业终止时,常规用户无法执行进一步的操作。 (Admins可以使用
Strigger
或设置Epilog脚本)进行终止,因此典型的操作是设置Bash ” trap'捕获信号和请求slurm发送该信号在工作被杀前几分钟。
对于因故障而终止,您可以测试返回代码您的主要程序在提交脚本中并采取相应的行动。
另一个可以看作是过度杀伤但更容易实现的选择是提交“监视”作业,依赖性在作业之后必须采取某些操作,并让该作业创建
stoped.txt
文件,基于会计状态。When the job has terminated, there is no way for regular users to perform further actions. (Admins can use
strigger
or setup epilog scripts)For termination due to time out, the typical course of action is to setup a Bash "trap" to catch a signal and request Slurm to send that signal a few minutes before the job is killed.
For termination due to failure, you can test the return code of your main program inside the submission script and act accordingly.
Another option, which could be seen as overkill, but is easier to implement, is to submit a "monitoring" job, dependent on the job after which some action must be taken, and have that job create the
stopped.txt
file based on the state of the job in the accounting.