Hadoop HDFS - 无法连接到主服务器上的端口

发布于 2024-12-24 22:24:37 字数 4477 浏览 4 评论 0原文

我已经设置了一个小型 Hadoop 集群进行测试。 NameNode（1 台机器）、SecondaryNameNode（1）和所有 DataNode（3）的安装过程相当顺利。这些机器被命名为“master”、“secondary”和“data01”、“data02”和“data03”。所有 DNS 均已正确设置，并且从主/辅助到所有计算机都配置了无密码 SSH。

我使用 bin/hadoop namenode -format 格式化集群，然后使用 bin/start-all.sh 启动所有服务。使用 jps 检查所有节点上的所有进程是否已启动并正在运行。我的基本配置文件如下所示：

<!-- conf/core-site.xml -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <!-- 
      on the master it's localhost
      on the others it's the master's DNS
      (ping works from everywhere)
    -->
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <!-- I picked /hdfs for the root FS -->
    <value>/hdfs/tmp</value>
  </property>
</configuration>

<!-- conf/hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>/hdfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/hdfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
</configuration>

# conf/masters
secondary

# conf/slaves
data01
data02
data03

我现在只是想让 HDFS 正常运行。

我创建了一个用于测试 hadoop fs -mkdirtesting 的目录，然后尝试使用 hadoop fs -copyFromLocal /tmp/*.txttesting 将一些文件复制到其中。这是 hadoop 崩溃的时候，或多或少给我这样的信息：

WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ... (such and such)

WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
  at ...

WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
  at ...

ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ...

等等。当我尝试从 DataNode 计算机运行 hadoop fs -lsr . 时，会出现类似的问题，结果却得到以下信息：

12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...

我说它是相似的，因为我怀疑这是端口可用性问题。运行 telnet master 9000 显示端口已关闭。我在某处读到这可能是 IPv6 冲突问题，因此在 conf/hadoop-env.sh 中定义了以下内容：

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

但这并没有解决问题。在主机上运行 netstat 会显示如下内容：

Proto Recv-Q Send-Q  Local Address       Foreign Address      State
tcp        0      0  localhost:9000      localhost:56387      ESTABLISHED
tcp        0      0  localhost:56386     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56387     localhost:9000       ESTABLISHED
tcp        0      0  localhost:56384     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56385     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56383     localhost:9000       TIME_WAIT

此时，我非常确定问题出在端口 (9000) 上，但我不确定就配置而言我错过了什么。有什么想法吗？谢谢。

更新

我发现将 DNS 名称硬编码到 /etc/hosts 中不仅有助于解决此问题，而且还能加快连接速度。缺点是您必须在集群中的所有计算机上执行此操作，并且在添加新节点时再次执行此操作。或者你可以只设置一个 DNS 服务器，但我没有。

这是集群中一个节点的示例（节点名为 hadoop01、hadoop02 等，主节点和辅助节点分别为 01 和 02）。大部分是由操作系统生成的节点：

# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes

# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on

# --- End list of nodes

# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain

希望这会有所帮助。

原文

I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back.

I formatted the cluster with bin/hadoop namenode -format, and then started all services using bin/start-all.sh. All processes on all nodes were checked to be up and running with jps. My basic configuration files look something like this:

<!-- conf/core-site.xml -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <!-- 
      on the master it's localhost
      on the others it's the master's DNS
      (ping works from everywhere)
    -->
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <!-- I picked /hdfs for the root FS -->
    <value>/hdfs/tmp</value>
  </property>
</configuration>

<!-- conf/hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>/hdfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/hdfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
</configuration>

# conf/masters
secondary

# conf/slaves
data01
data02
data03

I'm just trying to get HDFS running properly now.

I've created a dir for testing hadoop fs -mkdir testing, then tried to copy some files into it with hadoop fs -copyFromLocal /tmp/*.txt testing. This is when hadoop crashes, giving me more or less this:

WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ... (such and such)

WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
  at ...

WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
  at ...

ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ...

And so on. A similar issue occurs when I try to run hadoop fs -lsr . from a DataNode machine, only to get the following:

12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...

I'm saying it's similar, because I suspect this is a port availability issue. Running telnet master 9000 reveals that the port is closed. I've read somewhere that this might be an IPv6 clash issue, and thus defined the following in conf/hadoop-env.sh:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

But that didn't do the trick. Running netstat on the master reveals something like this:

Proto Recv-Q Send-Q  Local Address       Foreign Address      State
tcp        0      0  localhost:9000      localhost:56387      ESTABLISHED
tcp        0      0  localhost:56386     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56387     localhost:9000       ESTABLISHED
tcp        0      0  localhost:56384     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56385     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56383     localhost:9000       TIME_WAIT

At this point I'm pretty sure the problem is with the port (9000), but I'm not sure what I missed as far as configuration goes. Any ideas? Thanks.

update

I found that hard coding DNS names into /etc/hosts not only help resolve this, but also speeds up the connections. The downside is that you have to do this on all the machines in the cluster, and again when you add new nodes. Or you can just set up a DNS server, which I didn't.

Here's a sample of my one node in my cluster (nodes are named hadoop01, hadoop02, etc, with the master and secondary being 01 and 02). Node that most of it are generated by the OS:

# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes

# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on

# --- End list of nodes

# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain

Hope this helps.

分享到QQ

分享到微博