“distcp”和“distcp”之间的区别和“distcp -更新”?

发布于 2024-10-10 10:00:43 字数 180 浏览 6 评论 0原文

之间有什么区别

hadoop distcp

hadoop distcp -update

他们都会做同样的工作,只是我们如何称呼它们略有不同。它们都不会覆盖目标中已存在的文件。那么两组不同的命令有什么意义呢?

What is the difference between

hadoop distcp

and

hadoop distcp -update

Both of them would do the same work with only slight difference in how we call them. None of them overwrites an already existing file in the destination. What's the point then in two different set of commands?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

画▽骨i 2024-10-17 10:00:43

distcp 和 distcp -update 之间的区别在于,distcp 默认情况下会跳过文件,而如果 src 大小与 dst 大小不同,“distcp -update”将更新文件。

这在文档中有点令人困惑,因为 distcp 的默认性质是跳过文件存在以防止冲突。

来自文档:

“如前所述,这不是“同步”操作。检查的唯一标准是源文件和目标文件大小;如果它们不同,则源文件将替换目标文件”。

请记住,-update 不是像 rsync 那样的 delta-xfer 算法,它只进行大小检查,当文件大小相同但数据不同时,这并不完美。

我还应该详细说明一些并解释一下,无论大小是否匹配,distcp -overwrite 都会覆盖文件。这是一个破坏性的过程,因此请确保您确实想要这样做。

一些很好的例子可以在这里找到: http://hadoop.apache .org/common/docs/r0.19.2/distcp.html#uo

我还想举一个例子,说明我在两个集群之间的同步操作中所做的事情:

hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera

这将更新 hdfs-nn2 中不存在的所有文件大小与 hdfs-nn1 不匹配,并删除任何无关文件。如果使用 .Trash,则删除的所有文件都会放入调用 distcp 的用户的废纸篓中。

我会对其进行一些实验,以便您可以看到各种命令的效果,因为当您不小心擦除 TB 数据时可能会很痛苦,因此一定要使用垃圾箱。

The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.

It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.

From the docs:

"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."

Keep in mind -update is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.

I should also elaborate some and explain that distcp -overwrite will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.

Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo

I also want to give an example of what I do in a sync operation between two clusters:

hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera

This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.

I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文