“distcp”和“distcp”之间的区别和“distcp -更新”?
之间有什么区别
hadoop distcp
和
hadoop distcp -update
他们都会做同样的工作,只是我们如何称呼它们略有不同。它们都不会覆盖目标中已存在的文件。那么两组不同的命令有什么意义呢?
What is the difference between
hadoop distcp
and
hadoop distcp -update
Both of them would do the same work with only slight difference in how we call them. None of them overwrites an already existing file in the destination. What's the point then in two different set of commands?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
distcp 和 distcp -update 之间的区别在于,distcp 默认情况下会跳过文件,而如果 src 大小与 dst 大小不同,“distcp -update”将更新文件。
这在文档中有点令人困惑,因为 distcp 的默认性质是跳过文件存在以防止冲突。
来自文档:
“如前所述,这不是“同步”操作。检查的唯一标准是源文件和目标文件大小;如果它们不同,则源文件将替换目标文件”。
请记住,
-update
不是像 rsync 那样的 delta-xfer 算法,它只进行大小检查,当文件大小相同但数据不同时,这并不完美。我还应该详细说明一些并解释一下,无论大小是否匹配,
distcp -overwrite
都会覆盖文件。这是一个破坏性的过程,因此请确保您确实想要这样做。一些很好的例子可以在这里找到: http://hadoop.apache .org/common/docs/r0.19.2/distcp.html#uo
我还想举一个例子,说明我在两个集群之间的同步操作中所做的事情:
这将更新 hdfs-nn2 中不存在的所有文件大小与 hdfs-nn1 不匹配,并删除任何无关文件。如果使用 .Trash,则删除的所有文件都会放入调用 distcp 的用户的废纸篓中。
我会对其进行一些实验,以便您可以看到各种命令的效果,因为当您不小心擦除 TB 数据时可能会很痛苦,因此一定要使用垃圾箱。
The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.
It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.
From the docs:
"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."
Keep in mind
-update
is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.I should also elaborate some and explain that
distcp -overwrite
will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo
I also want to give an example of what I do in a sync operation between two clusters:
This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.
I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.