HDFS复制因子
当我将文件上传到 HDFS 时,如果我将复制因子设置为 1,那么文件分割将驻留在一台机器上,还是分割将分布到网络上的多台机器上?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据Hadoop:权威指南
这种逻辑是有意义的,因为它减少了不同节点之间的网络通信。但是,这本书是 2009 年出版的,Hadoop 框架发生了很多变化。
我认为这取决于客户端是否与 Hadoop 节点相同。如果客户端是 Hadoop 节点,则所有分片将位于同一节点上。尽管集群中有多个节点,但这并不能提供更好的读/写吞吐量。如果客户端与 Hadoop 节点不同,则为每个 split 随机选择该节点,因此 split 分布在集群中的节点上。现在,这提供了更好的读/写吞吐量。
写入多个节点的优点之一是,即使其中一个节点发生故障,几个分片也可能发生故障,但至少可以从剩余的分片中以某种方式恢复一些数据。
According to the Hadoop : Definitive Guide
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
如果将复制设置为 1,则该文件将仅存在于客户端节点上,即您上传文件的节点。
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
HDFS 复制因子用于制作数据副本(即,如果您的复制因子为 2,则您上传到 HDFS 的所有数据都将有一个副本。
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
如果设置复制因子为1则表示是单节点集群。它只有一个客户端节点 http://commandstech.com/replication-factor-in-hadoop /。您可以在其中上传文件,然后在单个节点或客户端节点中使用。
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.