Hadoop namenode:单点故障
Hadoop架构中的Namenode是一个单点故障。
拥有大型 Hadoop 集群的人们如何应对这个问题?
是否有一种行业认可的、行之有效的解决方案,其中辅助 Namenode 会在主 Namenode 发生故障时接管? EM>
The Namenode in the Hadoop architecture is a single point of failure.
How do people who have large Hadoop clusters cope with this problem?.
Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case the primary one fails ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Yahoo 对于不同集群大小的配置设置有某些建议,以考虑 NameNode 故障。例如:
Facebook 使用 Hadoop 的调整版本的 NameNode 元数据 其数据仓库;它有一些优化,专注于NameNode可靠性。除了 github 上提供的补丁之外,Facebook 似乎还使用 AvatarNode专门用于主备NameNode之间的快速切换。 Dhruba Borthakur 的博客 包含其他几个条目,提供了对作为单点故障的 NameNode 的进一步见解。
编辑: 有关 Facebook 对 NameNode 的改进的更多信息。
Yahoo has certain recommendations for configuration settings at different cluster sizes to take NameNode failure into account. For example:
Facebook uses a tweaked version of Hadoop for its data warehouses; it has some optimizations that focus on NameNode reliability. Additionally to the patches available on github, Facebook appears to use AvatarNode specifically for quickly switching between primary and secondary NameNodes. Dhruba Borthakur's blog contains several other entries offering further insights into the NameNode as a single point of failure.
Edit: Further info about Facebook's improvements to the NameNode.
Namenode 的高可用性已随 Hadoop 2.x 版本引入。
它可以通过两种模式来实现 - 使用 NFS 和 使用 QJM
但是,Quorum Journal Manager (QJM) 的高可用性是首选。
请看下面的 SE 问题,其中解释了完整的故障转移过程。
Hadoop 2.x 中的辅助 NameNode 使用和高可用性
Hadoop Namenode 故障转移过程如何工作?
High Availability of Namenode has been introduced with Hadoop 2.x release.
It can be achieved in two modes - With NFS and With QJM
But high availability with Quorum Journal Manager (QJM) is preferred option.
Have a look at below SE questions, which explains complete failover process.
Secondary NameNode usage and High availability in Hadoop 2.x
How does Hadoop Namenode failover process works?
大型 Hadoop 集群拥有数千个数据节点和一个名称节点。故障概率随机器数量线性增加(其他条件相同)。因此,如果 Hadoop 无法应对数据节点故障,它就无法扩展。由于仍然只有一个名称节点,因此存在单点故障 (SPOF),但故障概率仍然很低。
令人悲伤的是,Bkkbrad 关于 Facebook 向名称节点添加故障转移功能的回答是正确的。
Large Hadoop clusters have thousands of data nodes and one name node. The probability of failure goes up linearly with machine count (all else being equal). So if Hadoop didn't cope with data node failures it wouldn't scale. Since there's still only one name node the Single Point of Failure (SPOF) is there, but the probability of failure is still low.
That sad, Bkkbrad's answer about Facebook adding failover capability to the name node is right on.