如何确保数据均匀分布在 hadoop 节点上?
如果我将数据从本地系统复制到 HDFS,我能否确保数据均匀分布在节点上?
PS HDFS保证每个块将存储在3个不同的节点上。但这是否意味着我的文件的所有块都将在相同的 3 个节点上排序?或者HDFS会为每个新块随机选择它们?
If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您的复制设置为 3,它将被放置在 3 个独立的节点上。它所放置的节点数量由您的复制因子控制。如果您想要更大的分布,则可以通过编辑 $HADOOP_HOME/conf/hadoop-site.xml 并更改 dfs.replication 值来增加复制数量。
我相信新的方块几乎是随机放置的。需要考虑跨不同机架的分布(当 hadoop 意识到机架时)。有一个示例(找不到链接),如果您在 3 个和 2 个机架上进行复制,则 2 个块将位于一个机架中,第三个块将放置在另一个机架中。我猜想对于哪个节点获取机架中的块没有显示任何偏好。
我还没有看到任何指示或说明在相同节点上存储相同文件块的偏好。
如果您正在寻找强制跨节点平衡数据的方法(以任何值进行复制),一个简单的选择是 $HADOOP_HOME/bin/start-balancer.sh ,它将运行一个平衡过程来移动块自动集群。
这个和其他一些平衡选项可以在 Hadoop 常见问题解答中找到,
希望有所帮助。
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the
$HADOOP_HOME/conf/hadoop-site.xml
and changing thedfs.replication
value.I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is
$HADOOP_HOME/bin/start-balancer.sh
which will run a balancing process to move blocks around the cluster automatically.This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
您可以在名称节点的端口 50070 上打开 HDFS Web UI。它将显示有关数据节点的信息。您将在那里看到一件事 - 每个节点已使用的空间。
如果您没有 UI - 您可以查看数据节点的 HDFS 目录中使用的空间。
如果您有数据倾斜,您可以运行重新平衡器,这将逐渐解决它。
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
现在有了 Hadoop-385 补丁,我们可以选择块放置策略,这样将文件的所有块放置在同一节点中(对于复制节点也类似)。阅读有关此主题的博客 -看评论部分。
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
是的,Hadoop 按块分发数据,因此每个块将单独分发。
Yes, Hadoop distributes data per block, so each block would be distributed separately.