甚至hadoop/hive上的数据分布

发布于 2024-09-08 07:50:52 字数 1278 浏览 8 评论 0 原文

我正在尝试使用 2 台机器进行小型 hadoop 设置（用于实验）。我正在使用 Hive 加载大约 13GB 的数据，一个包含大约 3900 万行的表，复制因子为 1。

我的问题是 hadoop 总是将所有这些数据存储在单个数据节点上。仅当我使用 setrep 将 dfs_replication 因子更改为 2 时，hadoop 才会复制另一个节点上的数据。我还尝试了平衡器（$HADOOP_HOME/bin/start-balancer.sh -threshold 0）。平衡器认识到它需要移动 5GB 左右才能平衡。但说：没有任何方块可以移动。退出...并退出：

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

任何人都可以建议如何在hadoop上实现数据的均匀分布，而不需要复制吗？

原文

I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive.

My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication factor to 2 using setrep, hadoop copies data on the other node. I also tried the balancer ($HADOOP_HOME/bin/start-balancer.sh -threshold 0). The balancer recognizes that it needs to move around 5GB to balance. But says: No block can be moved. Exiting... and exits:

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

Can anybody suggest how to achieve even distribution of data on hadoop, without replication?

分享到QQ

分享到微博