HDFS复制因子

发布于 2024-12-07 10:50:43 字数 166 浏览 0 评论 0原文

当我将文件上传到 HDFS 时,如果我将复制因子设置为 1,那么文件分割将驻留在一台机器上,还是分割将分布到网络上的多台机器上?

hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?

hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

若能看破又如何 2024-12-14 10:50:43

根据Hadoop:权威指南

Hadoop的默认策略是将第一个副本放置在与客户端相同的节点上(对于
客户端在集群外运行,节点是随机选择的,尽管系统
尽量不要选择太满或太忙的节点)。第二个副本放置在
与第一个(机架外)不同的机架,随机选择。第三个副本放置在
与第二个机架相同,但位于随机选择的不同节点上。更多复制品
被放置在集群上的随机节点上,尽管系统试图避免放置
同一机架上的副本太多。

这种逻辑是有意义的,因为它减少了不同节点之间的网络通信。但是,这本书是 2009 年出版的,Hadoop 框架发生了很多变化。

我认为这取决于客户端是否与 Hadoop 节点相同。如果客户端是 Hadoop 节点,则所有分片将位于同一节点上。尽管集群中有多个节点,但这并不能提供更好的读/写吞吐量。如果客户端与 Hadoop 节点不同,则为每个 split 随机选择该节点,因此 split 分布在集群中的节点上。现在,这提供了更好的读/写吞吐量。

写入多个节点的优点之一是,即使其中一个节点发生故障,几个分片也可能发生故障,但至少可以从剩余的分片中以某种方式恢复一些数据。

According to the Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.

This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.

I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.

以酷 2024-12-14 10:50:43

如果将复制设置为 1,则该文件将仅存在于客户端节点上,即您上传文件的节点。

If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.

我爱人 2024-12-14 10:50:43
  • 如果您的集群是单节点,那么当您上传文件时,它将根据块大小进行溢出,并保留在单台机器中。
  • 如果您的集群是多节点,那么当您上传文件时,该文件将根据块大小进行溢出,并通过管道分发到集群中的不同数据节点,NameNode 将决定数据应在集群中移动到哪里。

HDFS 复制因子用于制作数据副本(即,如果您的复制因子为 2,则您上传到 HDFS 的所有数据都将有一个副本。

  • If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
  • If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.

HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.

阳光①夏 2024-12-14 10:50:43

如果设置复制因子为1则表示是单节点集群。它只有一个客户端节点 http://commandstech.com/replication-factor-in-hadoop /。您可以在其中上传文件,然后在单个节点或客户端节点中使用。

If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文