SGE - QSUB 无法在同步模式下提交作业

发布于 2024-10-15 11:39:29 字数 805 浏览 2 评论 0原文

我有一个 perl 脚本,它准备文件以输入到二进制程序,并将二进制程序的执行提交到 SGE 排队系统版本 6.2u2。

使用 -sync y 选项提交作业,以允许父 Perl 脚本能够使用 waitpid 函数监视已提交作业的状态。

这也非常有用,因为向父 perl 脚本发送 SIGTERM 会将此信号传播到每个子脚本,然后子脚本将此信号转发到 qsub,从而优雅地终止所有关联的已提交作业。

因此,能够使用此 -sync y 选项提交作业是相当重要的。

不幸的是,我不断收到以下错误:

由于错误而无法初始化环境:range_list 不包含任何元素

请注意“containes”的拼写不正确。这不是拼写错误。它只是向您展示了代码/错误消息的这个区域的维护有多差。

产生此错误的尝试提交甚至无法生成 STDOUT 和 STDERR 文件 *.e{JOBID}*.o{JOBID}。提交完全失败。

在谷歌上搜索此错误消息只会导致在模糊的留言板上出现未解决的帖子。

这个错误甚至不会可靠地发生。我可以重新运行我的脚本,相同的作业不一定会生成错误。我尝试从哪个节点提交作业似乎也并不重要。

我希望这里有人能解决这个问题。

因此,回答这些问题中的任何一个都可以解决我的问题:

  1. 此错误在最新版本的 SGE 中是否仍然存在?
  2. 我可以更改 qsub 的命令行选项来避免这种情况吗?
  3. 这个错误信息到底在说什么?

I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.

The jobs are submitted with the -sync yoption to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.

This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.

Thus, it is fairly crucial that I be able to submit jobs with this -sync y option.

Unfortunately, I keep getting the following error:

Unable to initialize environment because of error: range_list containes no elements

Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.

The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID} and *.o{JOBID}. The submission just completely fails.

Searching google for this error message only results in unresolved posts on obscure message board.

This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.

My hope is that someone here can figure this out.

Answers to any of these questions would thus solve my problem:

  1. Does this error persist in more recent versions of SGE?
  2. Can I alter my command line options for qsub to avoid this?
  3. What the hell is this error message talking about?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

镜花水月 2024-10-22 11:39:29

我们的网站在 SGE 6.2u5 中遇到了这个问题。我已经在邮件列表中发布了一些问题,但没有解决方案。到目前为止。

事实证明,该错误消息是伪造的。我通过阅读 Univa github“open-core”存储库上的更改日志发现了这一点。我后来看到Son Of Gridengine v8.0.0c Release Notes中提到的问题。

以下是 github 存储库中的相关提交:

错误消息应该是什么 表示您已达到系统中 qsubsync -y 作业数量的限制。此参数称为 MAX_DYN_EC。我们版本中的默认值是 99,上面的更改将该默认值增加到 1000。

MAX_DYN_EC 的定义(来自 sge_conf(5) 手册页)是:

设置动态事件客户端的最大数量(由 qsub -sync 使用)
y 和 Grid Engine DRMAA API 库会话)。默认已设置
到 99。动态事件客户端的数量不应更大
超过系统拥有的文件描述符数量的一半。数量
文件描述符的数量在所有 exec 的连接之间共享
qmaster 需要的主机、所有事件客户端和文件句柄。

您可以使用以下命令检查有多少个动态事件客户端:

$ qconf -secl | grep qsub | wc -l

我们已通过 qconf -mconfMAX_DYN_EC=1000 添加到 qmaster_params。我已经测试过提交数百个 qsub -sync y 作业,并且我们不再遇到 range_list 错误。在 MAX_DYN_EC 更改之前,这样做会可靠地触发错误。

Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.

It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.

Here are the related commits in the github repo:

What the error message should say is that you've hit the limit on the number of qsub sync -y jobs in the system. This parameter is known as MAX_DYN_EC. The default in our version was 99, and the changes above increase that default to 1000.

The definition of MAX_DYN_EC (from the sge_conf(5) man page) is:

Sets the max number of dynamic event clients (as used by qsub -sync
y and by Grid Engine DRMAA API library sessions). The default is set
to 99. The number of dynamic event clients should not be bigger
than half of the number of file descriptors the system has. The number
of file descriptors are shared among the connections to all exec
hosts, all event clients, and file handles that the qmaster needs.

You can check how many dynamic event clients you using the following command:

$ qconf -secl | grep qsub | wc -l

We have added MAX_DYN_EC=1000 to qmaster_params via qconf -mconf. I've tested submitting hundreds of qsub -sync y jobs and we no longer hit the range_list error. Prior to the MAX_DYN_EC change, doing so would reliably trigger the error.

温柔戏命师 2024-10-22 11:39:29

我找到了这个问题的解决方案 - 或者至少是一个解决方法。

我的目标是让 qsub 的各个实例保留在前台,因为它提交的作业仍在队列中或正在运行。这是通过 -sync 选项实现的,但导致了我在问题中描述的可怕的不可预测的错误。

此问题的解决方案是使用带有 now -n 选项的 qrsh 命令。这会导致作业的行为类似于 qsub -sync,因为我的脚本可以通过在 qrsh 实例上使用 waitpid 来隐式监视提交的作业是否正在运行。

此解决方案的唯一警告是,您正在操作的队列不得对交互式节点(由 qrsh 提供)和非交互式节点(可通过 qsub 访问)进行区分)。如果存在区别(交互节点可能比非交互节点少),那么此解决方法可能没有帮助。

然而,由于我还没有找到任何与 qsub -sync 问题类似的解决方案,所以让这篇文章通过互联网传播给任何陷入我类似情况的任性灵魂。

I found a solution to this problem - or at the very least a workaround.

My goal was to get individual instances of qsub to remain in the foreground as the job that it submitted was still in the queue or running. This was achieved with the -sync option but resulted in the horribly unpredictable bug that I describe in my question.

The solution to this problem was to use the qrsh command with the now -n option. This causes the job to behave similar to qsub -sync in that my script can implicitly monitor whether a submitted job is running by using waitpid on the qrsh instance.

The only caveat to this solution is that the queue you are operating on must not make any distinction between interactive nodes (offered by qrsh) and non-interactive nodes (accessible by qsub). Should a distinction exist (likely there are fewer interactive nodes than non-interactive) then this workaround may not help.

However, as I have found nothing even close to a solution to the qsub -sync problem that is anywhere as functional as this, let this post go out across the interwebs to any wayward soul caught in my similar situation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文