SGE - QSUB 无法在同步模式下提交作业
我有一个 perl 脚本,它准备文件以输入到二进制程序,并将二进制程序的执行提交到 SGE 排队系统版本 6.2u2。
使用 -sync y 选项提交作业,以允许父 Perl 脚本能够使用 waitpid 函数监视已提交作业的状态。
这也非常有用,因为向父 perl 脚本发送 SIGTERM 会将此信号传播到每个子脚本,然后子脚本将此信号转发到 qsub,从而优雅地终止所有关联的已提交作业。
因此,能够使用此 -sync y
选项提交作业是相当重要的。
不幸的是,我不断收到以下错误:
由于错误而无法初始化环境:range_list 不包含任何元素
请注意“containes”的拼写不正确。这不是拼写错误。它只是向您展示了代码/错误消息的这个区域的维护有多差。
产生此错误的尝试提交甚至无法生成 STDOUT 和 STDERR 文件 *.e{JOBID}
和 *.o{JOBID}
。提交完全失败。
在谷歌上搜索此错误消息只会导致在模糊的留言板上出现未解决的帖子。
这个错误甚至不会可靠地发生。我可以重新运行我的脚本,相同的作业不一定会生成错误。我尝试从哪个节点提交作业似乎也并不重要。
我希望这里有人能解决这个问题。
因此,回答这些问题中的任何一个都可以解决我的问题:
- 此错误在最新版本的 SGE 中是否仍然存在?
- 我可以更改 qsub 的命令行选项来避免这种情况吗?
- 这个错误信息到底在说什么?
I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.
The jobs are submitted with the -sync y
option to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.
This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.
Thus, it is fairly crucial that I be able to submit jobs with this -sync y
option.
Unfortunately, I keep getting the following error:
Unable to initialize environment because of error: range_list containes no elements
Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.
The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID}
and *.o{JOBID}
. The submission just completely fails.
Searching google for this error message only results in unresolved posts on obscure message board.
This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.
My hope is that someone here can figure this out.
Answers to any of these questions would thus solve my problem:
- Does this error persist in more recent versions of SGE?
- Can I alter my command line options for qsub to avoid this?
- What the hell is this error message talking about?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们的网站在 SGE 6.2u5 中遇到了这个问题。我已经在邮件列表中发布了一些问题,但没有解决方案。到目前为止。
事实证明,该错误消息是伪造的。我通过阅读 Univa github“open-core”存储库上的更改日志发现了这一点。我后来看到Son Of Gridengine v8.0.0c Release Notes中提到的问题。
以下是 github 存储库中的相关提交:
错误消息应该是什么 表示您已达到系统中
qsubsync -y
作业数量的限制。此参数称为 MAX_DYN_EC。我们版本中的默认值是 99,上面的更改将该默认值增加到 1000。MAX_DYN_EC 的定义(来自 sge_conf(5) 手册页)是:
您可以使用以下命令检查有多少个动态事件客户端:
我们已通过
qconf -mconf
将MAX_DYN_EC=1000
添加到qmaster_params
。我已经测试过提交数百个qsub -sync y
作业,并且我们不再遇到 range_list 错误。在MAX_DYN_EC
更改之前,这样做会可靠地触发错误。Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.
It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.
Here are the related commits in the github repo:
What the error message should say is that you've hit the limit on the number of
qsub sync -y
jobs in the system. This parameter is known asMAX_DYN_EC
. The default in our version was 99, and the changes above increase that default to 1000.The definition of
MAX_DYN_EC
(from the sge_conf(5) man page) is:You can check how many dynamic event clients you using the following command:
We have added
MAX_DYN_EC=1000
toqmaster_params
viaqconf -mconf
. I've tested submitting hundreds ofqsub -sync y
jobs and we no longer hit the range_list error. Prior to theMAX_DYN_EC
change, doing so would reliably trigger the error.我找到了这个问题的解决方案 - 或者至少是一个解决方法。
我的目标是让
qsub
的各个实例保留在前台,因为它提交的作业仍在队列中或正在运行。这是通过 -sync 选项实现的,但导致了我在问题中描述的可怕的不可预测的错误。此问题的解决方案是使用带有
now -n
选项的qrsh
命令。这会导致作业的行为类似于 qsub -sync,因为我的脚本可以通过在 qrsh 实例上使用 waitpid 来隐式监视提交的作业是否正在运行。此解决方案的唯一警告是,您正在操作的队列不得对交互式节点(由
qrsh
提供)和非交互式节点(可通过qsub
访问)进行区分)。如果存在区别(交互节点可能比非交互节点少),那么此解决方法可能没有帮助。然而,由于我还没有找到任何与 qsub -sync 问题类似的解决方案,所以让这篇文章通过互联网传播给任何陷入我类似情况的任性灵魂。
I found a solution to this problem - or at the very least a workaround.
My goal was to get individual instances of
qsub
to remain in the foreground as the job that it submitted was still in the queue or running. This was achieved with the-sync
option but resulted in the horribly unpredictable bug that I describe in my question.The solution to this problem was to use the
qrsh
command with thenow -n
option. This causes the job to behave similar toqsub -sync
in that my script can implicitly monitor whether a submitted job is running by usingwaitpid
on the qrsh instance.The only caveat to this solution is that the queue you are operating on must not make any distinction between interactive nodes (offered by
qrsh
) and non-interactive nodes (accessible byqsub
). Should a distinction exist (likely there are fewer interactive nodes than non-interactive) then this workaround may not help.However, as I have found nothing even close to a solution to the
qsub -sync
problem that is anywhere as functional as this, let this post go out across the interwebs to any wayward soul caught in my similar situation.