请推荐 Microsoft HPC 的替代方案
我们的目标是在集群上实现一个分布式系统,该系统将执行资源消耗大、存储 I/O 较大的基于图像的计算,具有以下特点:
- 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。
- 它是围绕工作任务概念构建的。一个作业可能有 1 到 100,000 个任务。
- 由用户在管理节点上启动的作业会导致在计算节点上创建任务。
- 任务动态创建其他任务。
- 有些任务可能运行几分钟,而另一些任务可能需要几个小时。
- 这些任务根据依赖层次结构运行,该层次结构可以动态更新。
- 该作业可能会暂停并稍后恢复。
- 每个任务都需要特定的资源,如 CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。
- 这些任务将进度和结果反馈给经理。
- 管理器知道任务是活动的还是挂起的。
我们发现 Windows HPC Server 2008 (HPCS) R2 在概念上非常接近我们的需求。然而,有一些关键的缺点:
- 随着任务数量的增加,任务的创建速度呈指数级下降。提交数千个以上的任务在时间上是难以忍受的。
- 任务无法将其进度报告给经理,只有作业可以。
- 在运行时与任务没有通信,这使得无法检查任务是否正在运行或可能需要重新启动。
- HPCS 只知道节点、CPU 核心和内存作为资源单元。我们不能引入我们自己的资源单元(例如可用磁盘空间、自定义硬件设备等)。
这是我的问题:有人知道和/或有过可以帮助我们的分布式计算框架的经验吗?我们正在使用Windows。
We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:
- There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
- It is built around job-task concept. A job may have one to 100,000 tasks.
- A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
- Tasks create other tasks on the fly.
- Some tasks may run for minutes, while others may take many hours.
- The tasks run according to a dependency hierarchy, which may be updated on the fly.
- The job may be paused and resumed later.
- Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
- The tasks tell their progress and result back to the manager.
- The manager is aware if the task is alive or hanged.
We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:
- Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
- Task is unable to report its progress back to the manager, only job can.
- There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
- HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).
Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我会看一下 Condor 高吞吐量计算项目。它支持 Windows(以及 Linux 和 OSX)客户端和服务器,使用 DAGman< 处理任务之间的复杂依赖关系/a> 并可以暂停(甚至移动)任务。我有过使用基于 Condor 的系统扩展到大学校园内数千台机器的经验。
I would take a look at the Condor high throughput computing project. It supports windows (and linux, and OSX) clients and servers, handles complex dependencies between tasks using DAGman and can suspend (and even move) tasks. I've experience of systems based on Condor that scale to thousands of machines across university campuses.
Platform LSF 将满足您所需的一切。它在 Windows 上运行。它是商业的,可以在支持下购买。
是。 1. 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。
是 2. 它是围绕工作任务概念构建的。一个作业可能有 1 到 100,000 个任务。
是 3. 由用户在管理节点上启动的作业会导致在计算节点上创建任务。
是 4. 任务会动态创建其他任务。
是 5. 有些任务可能运行几分钟,而另一些任务可能需要几个小时。
是 6. 任务根据依赖层次结构运行,该层次结构可以动态更新。
是 7. 作业可能会暂停并稍后恢复。
是 8. 每个任务都需要特定的资源,如CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。
是 9. 任务将进度和结果反馈给经理。
是 10. 管理器知道任务是活动的还是挂起的。
Platform LSF will do everything you need. It runs on Windows. It is commercial, and can be purchased with support.
Yes. 1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
Yes 2. It is built around job-task concept. A job may have one to 100,000 tasks.
Yes 3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
Yes 4. Tasks create other tasks on the fly.
Yes 5. Some tasks may run for minutes, while others may take many hours.
Yes 6. The tasks run according to a dependency hierarchy, which may be updated on the fly.
Yes 7. The job may be paused and resumed later.
Yes 8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
Yes 9. The tasks tell their progress and result back to the manager.
Yes 10. The manager is aware if the task is alive or hanged.
您看过Beowulf吗?有很多发行版可供选择,还有很多自定义选项。您应该能够找到满足您需求的东西......
Have you looked at Beowulf? Lots of distributions to choose from, and lots of customization options. You ought to be able to find something to meet your needs...
我推荐 Beowulf,因为 Beowulf 的行为更像是一台机器而不是许多工作站。
I would recommend Beowulf cause Beowulf behaves more like a single machine rather than many workstations.
尝试一下gridgain。这将使运行时添加节点变得非常容易,并且您可以使用 jmx 接口监视/管理集群
give gridgain a try. This should make runtime addition of nodes very easy, and you can monitor/manage the cluster using jmx interfaces
如果您不介意将项目托管在云中,则可能需要查看 Windows Azure /Appfabric。 AFAIK 它允许您通过工作流程分配作业,并且您可以随着负载的增加动态添加更多工作机器来处理您的作业。
If you don't mind hosting your project in a cloud, you might want to have a look at Windows Azure / Appfabric. AFAIK it allows you to distribute your jobs via workflows and you can dynamically add more worker machines to handle your jobs as the load increases.
使用Data Synapse Grid Server绝对可以解决此类问题。
` 10. 管理器知道任务是活动的还是挂起的。 是
You can definitely solve this sort of problem using Data Synapse Grid Server.
` 10. The manager is aware if the task is alive or hanged. yes
您是否检查过SunGrid Engine?我已经很久没有使用它了,也从未充分发挥过它的功能,但这是我的理解。
`
10. 管理器知道任务是活动的还是挂起的。 是
Have you examined the SunGrid Engine? It's been a long time since I used it, and I never used it to its full capabilities, but this is my understanding.
`
10. The manager is aware if the task is alive or hanged. yes