如何将文件从 S3 复制到 Amazon EMR HDFS?

发布于 2024-12-05 10:20:40 字数 159 浏览 1 评论 0原文

我正在 EMR 上运行 Hive, 并且需要将一些文件复制到所有EMR实例中。

据我了解,一种方法是将文件复制到每个节点上的本地文件系统,另一种方法是将文件复制到 HDFS,但是我还没有找到一种直接从 S3 复制到 HDFS 的简单方法。

解决这个问题的最佳方法是什么?

I'm running hive over EMR,
and need to copy some files to all EMR instances.

One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.

What is the best way to go about this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦里人 2024-12-12 10:20:40

最好的方法是使用 Hadoop 的 distcp 命令。示例(在集群节点之一上):

% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile

这将从名为 myfile 的 S3 存储桶复制一个名为 myfile 的文件mybucket 到 HDFS 中的 /root/myfile。请注意,此示例假设您在“本机”模式下使用 S3 文件系统;这意味着 Hadoop 将 S3 中的每个对象视为一个文件。如果您在块模式下使用 S3,则可以将上例中的 s3n 替换为 s3。有关本机 S3 和块模式之间的差异的更多信息以及对上面示例的详细说明,请参阅 http ://wiki.apache.org/hadoop/AmazonS3

我发现distcp是一个非常强大的工具。除了能够使用它在 S3 中复制大量文件之外,您还可以对大型数据集执行快速的集群到集群复制。 distcp 不是通过单个节点推送所有数据,而是并行使用多个节点来执行传输。与将所有内容复制到本地文件系统作为中介的替代方案相比,这使得 distcp 在传输大量数据时速度显着加快。

the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):

% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile

This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.

I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.

笑梦风尘 2024-12-12 10:20:40

现在 Amazon 本身有一个通过 distcp 实现的包装器,即: s3distcp

S3DistCp 是 DistCp 的扩展,经过优化可与
Amazon Web Services (AWS),特别是 Amazon Simple Storage Service
(亚马逊 S3)。您可以通过将 S3DistCp 添加为作业流程中的步骤来使用它。
使用S3DistCp,您可以高效地从
Amazon S3 到 HDFS 中,可以通过以下步骤进行处理
您的 Amazon Elastic MapReduce (Amazon EMR) 作业流程。您还可以使用
S3DistCp 用于在 Amazon S3 存储桶之间复制数据或从 HDFS 复制数据到 Amazon
S3

示例将日志文件从 Amazon S3 复制到 HDFS

以下示例说明如何将存储在 Amazon S3 存储桶中的日志文件复制到 HDFS。在此示例中,--srcPattern 选项用于限制复制到守护程序日志的数据。

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'

Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .

S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3

Example Copy log files from Amazon S3 to HDFS

This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
狼性发作 2024-12-12 10:20:40

请注意,根据亚马逊的说法,在 http://docs.amazonwebservices.com/ElasticMapReduce/ latest/DeveloperGuide/FileSystemConfig.html“Amazon Elastic MapReduce - 文件系统配置”,S3 块文件系统已弃用,其 URI 前缀现在为s3bfs:// 并且他们特别不鼓励使用它,因为“它可能会触发竞争条件,从而可能导致您的作业流程失败”。

根据同一页面,HDFS 现在是 S3 下的“一流”文件系统,尽管它是短暂的(当 Hadoop 作业结束时就会消失)。

Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".

According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文