主(发送)节点的 Gridgain 故障转移
我正在研究批处理问题。解决方案需要处理出现故障的硬件。
有主节点(启动任务执行)和执行作业的工作节点。我知道工作节点故障转移是如何工作的,但我找不到有关主节点故障转移的任何信息。每当启动任务的主节点失败时,整个任务就会被取消。
那么有什么办法可以完成任务处理呢?
您能否建议实现主节点故障转移的最佳方法?
亲切的问候, 库巴
I am working on batch processing problem. Solution needs to handle failing hardware.
There is master node (which initiates tasks executions) and worker nodes which execute the jobs. I know how failover of worker nodes works but I could not find any information about failover of master nodes. Whenever master node which started a task fails the whole task is canceled.
Is there any way to finish task processing then?
Could you suggest the best way of implementing failover of master node?
Kind Regards,
Kuba
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
每当你的主节点死亡时,基本上就没有人来执行你的 MapReduce 任务的“reduce”步骤。
有几种方法可以尝试缓解此问题:
使用 GridCheckpointSpi(GridTaskSession.saveCheckpoint(..) API)保存中间检查点,然后当任务在节点崩溃后重新启动时,您可以检查是否保存了检查点并启动
与 (1) 中的操作相同,但改用数据网格 (GridCache API)。
如果您不关心“减少”,请让您的作业忽略“取消”调用,并让它们在完成后将结果保存在数据网格中。
- 最好的
Whenever your master node dies, basically there is noone to perform the "reduce" step of your MapReduce task.
There are several ways you can try mitigating this problem:
Save intermediate checkpoints using GridCheckpointSpi (GridTaskSession.saveCheckpoint(..) API) and then when your task restarts after node crash, you can check if there is a checkpoint saved and start from it.
Do the same as in (1), but use the data grid instead (GridCache API).
If you don't care about "reduce", have your jobs ignore the "cancel" call and just have them save the results in data grid when they are done.
--Best