甚至hadoop/hive上的数据分布

发布于 2024-09-08 07:50:52 字数 1278 浏览 8 评论 0 原文

我正在尝试使用 2 台机器进行小型 hadoop 设置(用于实验)。我正在使用 Hive 加载大约 13GB 的数据,一个包含大约 3900 万行的表,复制因子为 1。

我的问题是 hadoop 总是将所有这些数据存储在单个数据节点上。仅当我使用 setrep 将 dfs_replication 因子更改为 2 时,hadoop 才会复制另一个节点上的数据。我还尝试了平衡器($HADOOP_HOME/bin/start-balancer.sh -threshold 0)。平衡器认识到它需要移动 5GB 左右才能平衡。但说:没有任何方块可以移动。退出...并退出:

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

任何人都可以建议如何在hadoop上实现数据的均匀分布,而不需要复制吗?

I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive.

My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication factor to 2 using setrep, hadoop copies data on the other node. I also tried the balancer ($HADOOP_HOME/bin/start-balancer.sh -threshold 0). The balancer recognizes that it needs to move around 5GB to balance. But says: No block can be moved. Exiting... and exits:

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

Can anybody suggest how to achieve even distribution of data on hadoop, without replication?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

爱要勇敢去追 2024-09-15 07:50:52

您是否使用两台机器作为数据节点?可能性极低,但你可以帮我确认这一点。

通常在 2 台机器的集群中,我希望一台机器作为名称节点,另一台作为数据节点。因此,当您将复制因子设置为 1 时,数据将被复制到唯一可用的数据节点。如果将其更改为 2,它可能会在集群中寻找另一个数据节点来将数据复制到但找不到它,因此它可能会退出。

are you using both your machines as datanodes? Highly unlikely but you can confirm this for me.

Typically in a 2 machine cluster, I'd expect one machine to be the namenode and the other one to be the datanode. So the fact that when you set the replication factor to 1, the data gets copied to the only datanode available to it. If you change it to 2, it may look for another datanode in the cluster to copy the data to but won't find it and hence it may exit.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文