请推荐 Microsoft HPC 的替代方案

发布于 2024-09-07 19:41:05 字数 801 浏览 9 评论 0原文

我们的目标是在集群上实现一个分布式系统,该系统将执行资源消耗大、存储 I/O 较大的基于图像的计算,具有以下特点:

  1. 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。
  2. 它是围绕工作任务概念构建的。一个作业可能有 1 到 100,000 个任务。
  3. 由用户在管理节点上启动的作业会导致在计算节点上创建任务。
  4. 任务动态创建其他任务。
  5. 有些任务可能运行几分钟,而另一些任务可能需要几个小时。
  6. 这些任务根据依赖层次结构运行,该层次结构可以动态更新。
  7. 该作业可能会暂停并稍后恢复。
  8. 每个任务都需要特定的资源,如 CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。
  9. 这些任务将进度和结果反馈给经理。
  10. 管理器知道任务是活动的还是挂起的。

我们发现 Windows HPC Server 2008 (HPCS) R2 在概念上非常接近我们的需求。然而,有一些关键的缺点:

  1. 随着任务数量的增加,任务的创建速度呈指数级下降。提交数千个以上的任务在时间上是难以忍受的。
  2. 任务无法将其进度报告给经理,只有作业可以。
  3. 在运行时与任务没有通信,这使得无法检查任务是否正在运行或可能需要重新启动。
  4. HPCS 只知道节点、CPU 核心和内存作为资源单元。我们不能引入我们自己的资源单元(例如可用磁盘空间、自定义硬件设备等)。

这是我的问题:有人知道和/或有过可以帮助我们的分布式计算框架的经验吗?我们正在使用Windows。

We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:

  1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
  2. It is built around job-task concept. A job may have one to 100,000 tasks.
  3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
  4. Tasks create other tasks on the fly.
  5. Some tasks may run for minutes, while others may take many hours.
  6. The tasks run according to a dependency hierarchy, which may be updated on the fly.
  7. The job may be paused and resumed later.
  8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
  9. The tasks tell their progress and result back to the manager.
  10. The manager is aware if the task is alive or hanged.

We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:

  1. Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
  2. Task is unable to report its progress back to the manager, only job can.
  3. There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
  4. HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).

Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

草莓酥 2024-09-14 19:41:05

I would take a look at the Condor high throughput computing project. It supports windows (and linux, and OSX) clients and servers, handles complex dependencies between tasks using DAGman and can suspend (and even move) tasks. I've experience of systems based on Condor that scale to thousands of machines across university campuses.

温柔嚣张 2024-09-14 19:41:05

Platform LSF 将满足您所需的一切。它在 Windows 上运行。它是商业的,可以在支持下购买。

是。 1. 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。

2. 它是围绕工作任务概念构建的。一个作业可能有 1 到 100,000 个任务。

3. 由用户在管理节点上启动的作业会导致在计算节点上创建任务。

4. 任务会动态创建其他任务。

5. 有些任务可能运行几分钟,而另一些任务可能需要几个小时。

6. 任务根据依赖层次结构运行,该层次结构可以动态更新。

7. 作业可能会暂停并稍后恢复。

8. 每个任务都需要特定的资源,如CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。

9. 任务将进度和结果反馈给经理。

10. 管理器知道任务是活动的还是挂起的。

Platform LSF will do everything you need. It runs on Windows. It is commercial, and can be purchased with support.

Yes. 1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.

Yes 2. It is built around job-task concept. A job may have one to 100,000 tasks.

Yes 3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.

Yes 4. Tasks create other tasks on the fly.

Yes 5. Some tasks may run for minutes, while others may take many hours.

Yes 6. The tasks run according to a dependency hierarchy, which may be updated on the fly.

Yes 7. The job may be paused and resumed later.

Yes 8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.

Yes 9. The tasks tell their progress and result back to the manager.

Yes 10. The manager is aware if the task is alive or hanged.

狂之美人 2024-09-14 19:41:05

您看过Beowulf吗?有很多发行版可供选择,还有很多自定义选项。您应该能够找到满足您需求的东西......

Have you looked at Beowulf? Lots of distributions to choose from, and lots of customization options. You ought to be able to find something to meet your needs...

聊慰 2024-09-14 19:41:05

我推荐 Beowulf,因为 Beowulf 的行为更像是一台机器而不是许多工作站。

I would recommend Beowulf cause Beowulf behaves more like a single machine rather than many workstations.

小情绪 2024-09-14 19:41:05

尝试一下gridgain。这将使运行时添加节点变得非常容易,并且您可以使用 jmx 接口监视/管理集群

give gridgain a try. This should make runtime addition of nodes very easy, and you can monitor/manage the cluster using jmx interfaces

寄离 2024-09-14 19:41:05

如果您不介意将项目托管在云中,则可能需要查看 Windows Azure /Appfabric。 AFAIK 它允许您通过工作流程分配作业,并且您可以随着负载的增加动态添加更多工作机器来处理您的作业。

If you don't mind hosting your project in a cloud, you might want to have a look at Windows Azure / Appfabric. AFAIK it allows you to distribute your jobs via workflows and you can dynamically add more worker machines to handle your jobs as the load increases.

香橙ぽ 2024-09-14 19:41:05

使用Data Synapse Grid Server绝对可以解决此类问题。

  1. 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。 是的,一个 Broker 可以轻松处理 2000 个引擎。
  2. 它是围绕作业任务概念构建的。一个作业可能有 1 到 100,000 个任务。 是的,我已经排队了超过 250,000 个任务,没有出现任何问题。最终,您将耗尽内存。
  3. 由用户在管理器节点上启动的作业会导致在计算节点上创建任务。
  4. 任务会动态创建其他任务。 这是可以完成的,尽管我不推荐这种模型
  5. 有些任务可能运行几分钟,而另一些任务可能需要几个小时。
  6. 任务根据依赖层次结构运行,该层次结构可以动态更新。 是的,但我会在网格计算基础设施之外进行管理
  7. 该作业可能会暂停并稍后恢复。
  8. 每个任务都需要特定的资源,如 CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。
  9. 任务将进度和结果反馈给经理。

` 10. 管理器知道任务是活动的还是挂起的。

You can definitely solve this sort of problem using Data Synapse Grid Server.

  1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. Yes, a Broker can easily handle 2000 Engines.
  2. It is built around job-task concept. A job may have one to 100,000 tasks. Yes, I have queued in excess of 250,000 tasks without issue. Eventually you will run out of memory.
  3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
  4. Tasks create other tasks on the fly. It can be done, although I would not recommend this sort of model
  5. Some tasks may run for minutes, while others may take many hours. yes
  6. The tasks run according to a dependency hierarchy, which may be updated on the fly. yes, but I would manage this outside of the grid computing infrastructure
  7. The job may be paused and resumed later. yes
  8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. yes
  9. The tasks tell their progress and result back to the manager. yes

` 10. The manager is aware if the task is alive or hanged. yes

掌心的温暖 2024-09-14 19:41:05

您是否检查过SunGrid Engine?我已经很久没有使用它了,也从未充分发挥过它的功能,但这是我的理解。

  1. 有一个专用的管理计算机节点和最多 100 个计算节点。集群必须易于扩展。
  2. 它是围绕工作任务概念构建的。一个作业可能有 1 到 100,000 个任务。 不确定
  3. 由用户在管理节点上启动的作业会导致在计算节点上创建任务。
  4. 任务会动态创建其他任务。 我认为是这样?
  5. 有些任务可能会运行几分钟,而另一些任务可能需要几个小时。
  6. 任务根据依赖层次结构运行,该层次结构可以动态更新。 不确定
  7. 作业可能会暂停并稍后恢复。 不确定
  8. 每个任务都需要特定的资源,如 CPU(核心)、内存和本地硬盘空间。管理者在安排任务时应该意识到这一点。 非常确定
  9. 任务将进度和结果反馈给经理。 非常确定

`
10. 管理器知道任务是活动的还是挂起的。

Have you examined the SunGrid Engine? It's been a long time since I used it, and I never used it to its full capabilities, but this is my understanding.

  1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. yes
  2. It is built around job-task concept. A job may have one to 100,000 tasks. not sure
  3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
  4. Tasks create other tasks on the fly. I think so?
  5. Some tasks may run for minutes, while others may take many hours. yes
  6. The tasks run according to a dependency hierarchy, which may be updated on the fly. not sure
  7. The job may be paused and resumed later. not sure
  8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. pretty sure
  9. The tasks tell their progress and result back to the manager. pretty sure

`
10. The manager is aware if the task is alive or hanged. yes

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文