Hadoop HDFS - 无法连接到主服务器上的端口
我已经设置了一个小型 Hadoop 集群进行测试。 NameNode(1 台机器)、SecondaryNameNode(1)和所有 DataNode(3)的安装过程相当顺利。这些机器被命名为“master”、“secondary”和“data01”、“data02”和“data03”。所有 DNS 均已正确设置,并且从主/辅助到所有计算机都配置了无密码 SSH。
我使用 bin/hadoop namenode -format
格式化集群,然后使用 bin/start-all.sh
启动所有服务。使用 jps
检查所有节点上的所有进程是否已启动并正在运行。我的基本配置文件如下所示:
<!-- conf/core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<!--
on the master it's localhost
on the others it's the master's DNS
(ping works from everywhere)
-->
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<!-- I picked /hdfs for the root FS -->
<value>/hdfs/tmp</value>
</property>
</configuration>
<!-- conf/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
# conf/masters
secondary
# conf/slaves
data01
data02
data03
我现在只是想让 HDFS 正常运行。
我创建了一个用于测试 hadoop fs -mkdirtesting
的目录,然后尝试使用 hadoop fs -copyFromLocal /tmp/*.txttesting
将一些文件复制到其中。这是 hadoop 崩溃的时候,或多或少给我这样的信息:
WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ... (such and such)
WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
at ...
WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
at ...
ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ...
等等。当我尝试从 DataNode 计算机运行 hadoop fs -lsr .
时,会出现类似的问题,结果却得到以下信息:
12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...
我说它是相似的,因为我怀疑这是端口可用性问题。运行 telnet master 9000 显示端口已关闭。我在某处读到这可能是 IPv6 冲突问题,因此在 conf/hadoop-env.sh 中定义了以下内容:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
但这并没有解决问题。在主机上运行 netstat
会显示如下内容:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:9000 localhost:56387 ESTABLISHED
tcp 0 0 localhost:56386 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56387 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56384 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56385 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56383 localhost:9000 TIME_WAIT
此时,我非常确定问题出在端口 (9000) 上,但我不确定就配置而言我错过了什么。有什么想法吗?谢谢。
更新
我发现将 DNS 名称硬编码到 /etc/hosts
中不仅有助于解决此问题,而且还能加快连接速度。缺点是您必须在集群中的所有计算机上执行此操作,并且在添加新节点时再次执行此操作。或者你可以只设置一个 DNS 服务器,但我没有。
这是集群中一个节点的示例(节点名为 hadoop01
、hadoop02
等,主节点和辅助节点分别为 01 和 02)。大部分是由操作系统生成的节点:
# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes
# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on
# --- End list of nodes
# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain
希望这会有所帮助。
I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back.
I formatted the cluster with bin/hadoop namenode -format
, and then started all services using bin/start-all.sh
. All processes on all nodes were checked to be up and running with jps
. My basic configuration files look something like this:
<!-- conf/core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<!--
on the master it's localhost
on the others it's the master's DNS
(ping works from everywhere)
-->
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<!-- I picked /hdfs for the root FS -->
<value>/hdfs/tmp</value>
</property>
</configuration>
<!-- conf/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
# conf/masters
secondary
# conf/slaves
data01
data02
data03
I'm just trying to get HDFS running properly now.
I've created a dir for testing hadoop fs -mkdir testing
, then tried to copy some files into it with hadoop fs -copyFromLocal /tmp/*.txt testing
. This is when hadoop crashes, giving me more or less this:
WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ... (such and such)
WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
at ...
WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
at ...
ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ...
And so on. A similar issue occurs when I try to run hadoop fs -lsr .
from a DataNode machine, only to get the following:
12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...
I'm saying it's similar, because I suspect this is a port availability issue. Running telnet master 9000
reveals that the port is closed. I've read somewhere that this might be an IPv6 clash issue, and thus defined the following in conf/hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
But that didn't do the trick. Running netstat
on the master reveals something like this:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:9000 localhost:56387 ESTABLISHED
tcp 0 0 localhost:56386 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56387 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56384 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56385 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56383 localhost:9000 TIME_WAIT
At this point I'm pretty sure the problem is with the port (9000), but I'm not sure what I missed as far as configuration goes. Any ideas? Thanks.
update
I found that hard coding DNS names into /etc/hosts
not only help resolve this, but also speeds up the connections. The downside is that you have to do this on all the machines in the cluster, and again when you add new nodes. Or you can just set up a DNS server, which I didn't.
Here's a sample of my one node in my cluster (nodes are named hadoop01
, hadoop02
, etc, with the master and secondary being 01 and 02). Node that most of it are generated by the OS:
# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes
# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on
# --- End list of nodes
# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain
Hope this helps.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当有远程节点连接到 NameNode 时,将 hdfs://localhost:9000 中的 localhost 替换为 NameNode 中 fs.default.name 属性的 IP 地址或主机名。
日志文件中可能存在一些错误。 jps 确保进程正在运行。
Replace localhost in hdfs://localhost:9000 with ip-address or hostname for the fs.default.name property in NameNode when there are remote nodes connecting to the NameNode.
There might be some errors in the log files. jps makes sure that the process is running.
更正您的 /etc/hosts 文件以包含 localhost,或更正您的核心站点文件以指定托管 HDFS 文件系统的节点的 IP 或主机名。
Correct your /etc/hosts file to include localhost or correct your core-site file to specify ip or hostname of node that hosts HDFS filesystem.