Hadoop 减少任务挂起

发布于 2024-11-17 05:59:01 字数 2732 浏览 4 评论 0原文

我设置了一个具有4个节点的hadoop集群，当运行map-reduce任务时，map任务很快完成，而reduce任务挂在27%。我检查了日志，是reduce任务无法从map节点获取map输出。

master 的作业跟踪器日志显示如下消息：

---------------------------------
2011-06-27 19:55:14,748 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE)
'attempt_201106271953_0001_r_000000_0' to tip task_201106271953_0001_r_000000, for 
tracker 'tracker_web30.bbn.com.cn:localhost/127.0.0.1:56476'

master 的名称节点日志显示如下消息：

2011-06-27 14:00:52,898 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 
54310, call register(DatanodeRegistration(202.106.199.39:50010, storageID=DS-1989397900-
202.106.199.39-50010-1308723051262, infoPort=50075, ipcPort=50020)) from 
192.168.225.19:16129: error: java.io.IOException: verifyNodeRegistration: unknown 
datanode 202.106.199.3     9:50010

但是，“web30.bbn.com.cn”或 202.106.199.39、202.106.199.3 都不是从节点。我认为出现这样的ip/domains是因为hadoop无法解析一个节点（首先在内网DNS服务器中），然后它转到更高级别的DNS服务器，后来到顶部，仍然失败，然后是“垃圾”ip/domains被退回。

但我检查了我的配置，它是这样的：

---------------------------------
/etc/hosts:
127.0.0.1       localhost.localdomain localhost
::1     localhost6.localdomain6 localhost6
192.168.225.16 master
192.168.225.66 slave1
192.168.225.20 slave5
192.168.225.17 slave17

conf/core-site.xml：

---------------------------------
<?xml version="2.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop_tmp/hadoop_${user.name}</value>
    </property> 
    <property>
            <name>fs.default.name</name>
            <value>hdfs://master:54310</value>
     </property> 
    <property>
            <name>io.sort.mb</name>
            <value>1024</value>
        </property>
</configuration>

hdfs-site.xml：

---------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

masters：

---------------------------------
master

slaves：

---------------------------------
master
slave1
slave5
slave17

此外，所有防火墙（iptables）都已关闭，并且每2个节点之间的ssh都可以。所以我不知道错误到底来自哪里。请帮忙。多谢。

原文

I set up a hadoop cluster with 4 nodes, When running a map-reduce task, the map task finishes quickly, while the reduce task hangs at 27% percent. I checked the log, it's that the reduce task fails to fetch map output from map nodes.

The job tracker log of master shows messages like this:

---------------------------------
2011-06-27 19:55:14,748 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE)
'attempt_201106271953_0001_r_000000_0' to tip task_201106271953_0001_r_000000, for 
tracker 'tracker_web30.bbn.com.cn:localhost/127.0.0.1:56476'

And the name node log of master shows messages like this:

2011-06-27 14:00:52,898 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 
54310, call register(DatanodeRegistration(202.106.199.39:50010, storageID=DS-1989397900-
202.106.199.39-50010-1308723051262, infoPort=50075, ipcPort=50020)) from 
192.168.225.19:16129: error: java.io.IOException: verifyNodeRegistration: unknown 
datanode 202.106.199.3     9:50010

However, neither the "web30.bbn.com.cn" or 202.106.199.39, 202.106.199.3 is the slave node. I think such ip/domains appear because hadoop fails to resolve a node(first in the Intranet DNS server), then it goes to a higher-level DNS server, later to the top, still fails, then the "junk" ip/domains are returned.

But I checked my config, it goes like this:

---------------------------------
/etc/hosts:
127.0.0.1       localhost.localdomain localhost
::1     localhost6.localdomain6 localhost6
192.168.225.16 master
192.168.225.66 slave1
192.168.225.20 slave5
192.168.225.17 slave17

conf/core-site.xml:

---------------------------------
<?xml version="2.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop_tmp/hadoop_${user.name}</value>
    </property> 
    <property>
            <name>fs.default.name</name>
            <value>hdfs://master:54310</value>
     </property> 
    <property>
            <name>io.sort.mb</name>
            <value>1024</value>
        </property>
</configuration>

hdfs-site.xml:

---------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

masters:

---------------------------------
master

slaves:

---------------------------------
master
slave1
slave5
slave17

Also, all firewalls(iptables) are turned off, and ssh between each 2 nodes is ok.
so I don't know where exact the error comes from. Please help. Thanks a lot.

分享到QQ

分享到微博