一个 distcp 命令可将多个文件上传到 s3(无目录)

发布于 2025-01-09 09:54:44 字数 844 浏览 2 评论 0原文

我目前正在使用 Hadoop/HDFS 的 s3a 适配器,以允许我将多个文件从 Hive 数据库上传到特定的 s3 存储桶。我很紧张,因为我在网上找不到任何有关通过 distcp 指定一堆文件路径(不是目录)进行复制的信息。

我已经将程序设置为使用函数收集一组文件路径,将它们全部注入到 distcp 命令中,然后运行该命令:

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"

logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)

这基本上只是创建一个包含 15-20 个不同文件路径的长 distcp 命令。这行得通吗?我应该使用 -cp-put 命令而不是 distcp 吗?

(对我来说,将所有这些文件复制到它们自己的目录,然后 distcp 整个目录是没有意义的,因为我可以直接复制它们并跳过这些步骤......)

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.

I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"

logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)

This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp or -put commands instead of distcp?

(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌化 2025-01-16 09:54:44

-cp-put 会要求您下载 HDFS 文件,然后上传到 S3。那会慢很多。

我看不出这不起作用的直接原因,但是,阅读文档后,我建议使用 -f 标志。

例如,

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
    for file in files:
        f.write(f'hdfs://nameservice1{file}\n')

s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)

如果所有文件都已经在它们自己的目录中,那么您应该像您所说的那样复制该目录。

-cp and -put would require you to download the HDFS files, then upload to S3. That would be a lot slower.

I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f flag instead.

E.g.

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
    for file in files:
        f.write(f'hdfs://nameservice1{file}\n')

s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)

If the all files were already in their own directory, then you should just copy the directory, like you said.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文