批处理作业调度的替代方案(在计算池中)
由于我没有计算池中计算机的 root 权限,因此无法调整 atd 的负载参数以进行批处理,因此我正在寻找一种替代方法来进行作业调度。由于机器是由多个用户使用的,因此应该能够考虑到负载。或者,我正在寻找一种方法来为池中的所有机器执行此操作,即,有一个包含需要运行的作业的中央队列,以及一个将它们(通过 ssh)分发到下面的机器上的脚本一定的负载。有什么想法吗?
Since I don't have root rights on the machines in a compute pool, and thus cannot adapt the load parameters of atd for batch, I'm looking for an alternative way to do job scheduling. Since the machines are used by multiple users, it should be able to take the load into account. Optionally, I'm looking for a way to do this for all the machines it the pool, I.e. there is one central queue with jobs that need to be ran, and a script that distributes them (over ssh) over the machines that are under a certain load. Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先:与计算池的系统管理员交谈。如今,企业范围的作业调度程序已成为基础设施中相当常见的组件。通常,这些调度程序不会考虑系统负载。
如果上述方法没有带来好的解决方案,您应该仔细考虑您的作业会给计算机带来什么负载:您的作业可能会给 CPU 带来更大的压力,消耗大量内存,生成大量网络或磁盘 IO 活动。因此,确定您的作业是否应该开始可能取决于很多测量,其中一些测量您作为普通用户无法完成(在一定程度上取决于您运行的操作系统类型以及安全性有多严格)。无论如何:您只能考虑作业启动时的负载。显然,如果每个用户都这样做,那么您很快就会回到原点...
最好与您的系统管理员一起了解他们是否有某种资源控制(例如 Solaris 中的项目)通过它,他们可以确保您的批次不会破坏计算池中的节点。接下来,以能够应对操作系统拒绝资源请求的方式编写批处理作业。
编辑:至于分布式性质:对作业进行排队并使所有节点上的客户端都指向同一个队列,在资源控制的上下文中尽可能多地消耗......
First: go talk to the system administrators of the compute pool. Enterprise wide job schedulers have become a rather common component in infrastructures these days. Typically, these schedulers do not take into account system load though.
If the above doesn't lead to a good solution, you should carefully consider what load your jobs will impose on the machine: your jobs could be stressing the cpu more, consume large amounts of memory, generate lots of network or disk IO activity. Consequently, determining whether your job should start may depend on a lot of measurement, some of which you would not be able to do as an ordinary user (depends a bit on the kind of OS you are running, and how tight security is). In any case: you would only be able to take into account the load at the job's start up. Obviously, if every user would do this, you're back at square one in no time...
It might be a better idea to see with your system administrator if they have some sort of resource controls in place (e.g. projects in Solaris) through which they can make sure your batches are not tearing down the nodes in the compute pool. Next, write your batch jobs in such a way that they can cope with the OS declining requests for resources.
EDIT: As for the distributed nature: queueing up the jobs and having clients on all nodes point to the same queue, consuming as much as they can in the context of the resource controls...