将数据传入和传出 Elastic MapReduce HDFS
我编写了一个 Hadoop 程序,它需要在 HDFS 中进行某种布局,然后我需要从 HDFS 中获取文件。它可以在我的单节点 Hadoop 设置上运行,并且我渴望让它在 Elastic MapReduce 中的 10 个节点上运行。
我一直在做的是这样的:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
这是异步的,但是当作业完成后,我可以这样做
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
所以虽然这排序有效,但它很笨重,不是我想要的。有没有更干净的方法来做到这一点?
谢谢!
I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.
What I've been doing is something like this:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
distcp
它将文件复制为 MapReduce 作业,这将利用您的整个集群从 s3 并行复制。
(注意:每个路径上的尾部斜杠对于从一个目录复制到另一个目录很重要)
You can use
distcp
which will copy the files as a mapreduce jobThis makes use of your entire cluster to copy in parallel from s3.
(note: the trailing slashes on each path are important to copy from directory to directory)
@mat-kelcey,命令 distcp 是否期望 S3 中的文件具有最低权限级别?由于某种原因,我必须将文件的权限级别设置为“所有人”的“打开/下载”和“查看权限”,以便能够从引导程序或步骤脚本中访问这些文件。
@mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.