如何防止“hadoop fs rmr”创建 $folder$ 文件？

发布于 2024-11-03 05:01:14 字数 645 浏览 0 评论 0原文

我们使用 Amazon 的 Elastic Map Reduce 来执行一些大型文件处理作业。作为工作流程的一部分，我们偶尔需要从 S3 中删除可能已存在的文件。我们使用 hadoop fs 接口来执行此操作，如下所示：

hadoop fs -rmr s3://mybucket/a/b/myfile.log

这会相应地从 S3 中删除该文件，但在原处留下一个名为“s3://mybucket/a/b_$folder$”的空文件。如这个问题<中所述< /a>，Hadoop 的 Pig 无法处理这些文件，因此工作流程中的后续步骤可能会因该文件而阻塞。

（注意，我们使用 -rmr 或 -rm 或使用 s3:// 或 似乎并不重要>s3n:// 作为方案：所有这些都表现出所描述的行为。）

如何使用 hadoop fs 接口从 S3 中删除文件并确保不要留下这些麻烦文件落后？

原文

We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:

hadoop fs -rmr s3://mybucket/a/b/myfile.log

This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in the workflow can choke on this file.

(Note, it doesn't seem to matter whether we use -rmr or -rm or whether we use s3:// or s3n:// as the scheme: all of these exhibit the described behavior.)

How do I use the hadoop fs interface to remove files from S3 and be sure not to leave these troublesome files behind?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

池予 2024-11-10 05:01:14

我无法弄清楚是否可以以这种方式使用 hadoop fs 接口。但是，s3cmd 界面执行正确的操作（但一次仅针对一个密钥）：

s3cmd del s3://mybucket/a/b/myfile.log

这需要首先使用您的 AWS 凭证配置 ~/.s3cfg 文件。 s3cmd --configure 将以交互方式帮助您创建此文件。

I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):