Hadoop 安全模式恢复 - 花费大量时间

发布于 2024-09-01 06:49:21 字数 1235 浏览 2 评论 0原文

我们在 Amazon EC2 上运行集群。我们正在使用 cloudera 脚本来设置 hadoop。在主节点上,我们启动以下服务。

609   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker'
612 
613   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait'

在从机上,我们运行以下服务。

625   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode'
626   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker'

我们面临的主要问题是,hdfs 安全模式恢复需要一个多小时,这导致我们的工作完成延迟。

以下是主要日志消息。

1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s).
2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically.

第一条消息被抛出到任务跟踪器日志中,因为作业跟踪器未启动。由于 hdfs 安全模式恢复,作业跟踪器未启动。

第二条消息是在恢复过程中抛出的。

我做错了什么吗? 正常的 hdfs 安全模式恢复需要多长时间? 在启动作业跟踪器之前不启动任务跟踪器是否会加快速度? 亚马逊集群上是否存在任何已知的 hadoop 问题?

感谢您的帮助。

We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services.

609   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker'
612 
613   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait'

On the slave machine, we run the below services.

625   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode'
626   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker'

The main problem we are facing is, hdfs safemode recovery is taking more than an hour and this is causing delays in our job completion.

Below are the main log messages.

1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s).
2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically.

The first message is thrown in task trackers log because, job tracker is not started. job tracker didn't start because of hdfs safemode recovery.

The second message is thrown during the recovery process.

Is there something I am doing wrong?
How much time does normal hdfs safemode recovery takes?
Will there be any speedup, by not starting task trackers till job tracker is started?
Are there any known hadoop problems on amazon cluster?

Thanks for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

反目相谮 2024-09-08 06:49:21

安全模式下花费的时间通常与集群的大小成正比。也就是说,正常时间最多只有几分钟,而不是几小时。有几件事需要检查。

  1. 确认所有数据节点都正确启动。数据节点需要几秒钟或几分钟来报告大量块是正常的。检查数据节点日志以了解启动过程中发生的情况。
  2. 确保您有足够的名称节点线程(hdfs-site.xml 中的 dfs.namenode.handler.count),以便能够处理要签入的数据节点的数量。默认值为 10,这对于集群来说应该没问题最多 20 个节点左右。除此之外,增加这一点可能是有意义的。您可能会在数据节点日志中看到表明这一点的重试。这就是重试消息似乎向我表明的内容(例如重试 21 次)。

希望这有帮助。

The time spent in safe mode is usually proportional to the size of the cluster. That said, normal time is on the order of minutes at most, not hours. There are a few things to check.

  1. Confirm all data nodes are firing up correctly. It's normal for data nodes to take a few seconds or minutes for a large number of blocks to report in. Check the data node logs to see what's happening during start up.
  2. Ensure you have enough name node threads (dfs.namenode.handler.count in hdfs-site.xml) to be able to take care of the number of data nodes that want to check in. The default is 10 which should be fine for clusters up to 20 nodes or so. Beyond that, it may make sense to increase this. You may see retries occurring in the data node logs that would indicate this. This is what the retry messages seems to indicate to me (e.g. retry 21 times).

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文