在 SGE 作业中的集群上使用 tee 对 SIGTERM 进行故障排除
我有一些遗留的科学代码通过 SGE 在 Rocks 集群上运行。我有一个特定于应用程序的作业提交脚本,用于生成 qsub 脚本(即 Sun Grid Engine 获取并运行的脚本)。
在 qsub 脚本中,调用了我的旧应用程序。该应用程序将其输出发送到 STDOUT。 SGE 拦截 STDOUT 并将其假脱机到用户主目录中的文件中,以便用户可以实时查看构建的结果。我希望维持这种行为,但同时,我想在后台透明地记录所有输出。我认为 T 恤是实现这一目标的完美选择。
因此,我修改了作业提交脚本来运行应用程序并将 STDOUT 通过管道传输到 tee,这会将 STDOUT 保存到一个文件中,一旦作业完成,该文件就会复制到中央存储中。该应用程序按如下方式运行并通过管道传输到 tee:
\$GMSCOMMAND | tee \$SCRATCHDIR/gamess_output.log
问题是,自从我开始将代码通过管道传输到 tee 后,该应用程序就一直因 SIGTERM 而死亡,尤其是当我请求多个节点时。我尝试将 -i (忽略中断)参数与 tee 一起使用:这没有什么区别。
如果我将应用程序输出重定向到一个文件,然后在应用程序完成后对该文件进行cat,那么事情就可以正常工作,但是我不能允许用户实时查看结果累积(这是一个重要的要求)。
关于为什么这种 T 恤的使用可能会失败,您有什么想法吗?或者,关于我如何实现所需功能的任何想法?
I have some legacy scientific code running on a Rocks cluster, with SGE. I have an application-specific job submission script that generates qsub scripts (i.e. the script which Sun Grid Engine takes and runs).
Within the qsub script, my legacy app is called. This app sends it's output to STDOUT. SGE intercepts STDOUT and spools it into a file in the users home directory, so the user can see results build up in real-time. I want this behavior to be maintained, but at the same time, I want to transparently log all output in the background. I figured tee would be perfect to achieve this.
So I modified the job submission script to run the app and pipe STDOUT to tee, which saves STDOUT to a file that is copied to a central store once the job completes. The app is run and piped to tee as follows:
\$GMSCOMMAND | tee \$SCRATCHDIR/gamess_output.log
The problem is, ever since I've started piping the code to tee, the app has been dying with SIGTERMs, especially when I request several nodes. I tried using the -i (ignore interrupts) parameter with tee: it makes no difference.
Things work fine if I redirect the app output to a file then cat the file once the app is done, but then I can't allow users to view results buildup in real-time (which is an important requirement).
Any ideas about why this use of tee might be failing? Or alternatively, any ideas about how else I might achieve the desired functionality?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不知道为什么您的特定情况会失败,但一种选择可能是让
$GMCOMMAND
进行自己的日志记录。 (有效地将 T 恤放入应用程序内)。我想这个选项取决于更改旧应用程序的成本。如果失败,您可以使用自己的脚本/应用程序包装“旧版应用程序”来执行重定向/复制。
I don't know anything about why your particular case is failing, but one option might be to make
$GMSCOMMAND
do it's own logging. (Effectively put the tee inside the app). I guess this option depends on cost of changing the legacy app.Failing that you could wrap the 'legacy app' with your own script/application to do the redirection/duplication.
如果管道是您的问题,也许您可以通过使用“while/read”循环和进程替换来解决这个问题。这对你有用吗?
If pipes are your problem perhaps you can get around this by using a 'while/read' loop with process substitution. Does this work for you?