将大型数据集放到亚马逊弹性地图上减少
我想使用 Amazon EMR 来处理一些大型数据集(25GB 以上,可在 Internet 上下载)。与其将数据集下载到我自己的计算机上,然后将其重新上传到 Amazon 上,不如将数据集上传到 Amazon 上的最佳方法是什么?
我是否启动 EC2 实例,从实例内将数据集(使用 wget)下载到 S3,然后在运行 EMR 作业时访问 S3? (我之前没有使用过亚马逊的云基础设施,所以不确定我刚才说的是否有意义。)
There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我建议执行以下操作...
启动您的 EMR 集群
elastic-mapreduce --create --alive --other-options-here
登录主节点并从那里下载数据
wget http://blah/data
复制到HDFS
hadoop fs -copyFromLocal data /data
没有真正的理由将原始数据集放入 S3。如果您想保留结果,可以在关闭集群之前将它们移至 S3。
如果数据集由多个文件表示,您可以使用集群在机器上并行下载它。如果是这种情况,请告诉我,我将引导您完成该过程。
垫
I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat
如果您刚刚开始尝试 EMR,我猜您希望在 s3 上使用这些,这样您就不必启动交互式 Hadoop 会话(而是通过 AWS 控制台使用 EMR 向导)。
最好的方法是在与 S3 存储桶相同的区域启动一个微实例,使用 wget 下载到该计算机,然后使用类似 s3cmd(您可能需要在实例上安装)。在 Ubuntu 上:
您希望实例和 s3 存储桶位于同一区域的原因是为了避免额外的数据传输费用。尽管您需要支付 wget 实例的绑定带宽费用,但到 S3 的 xfer 将是免费的。
If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.
我不确定,但对我来说,hadoop 应该能够直接从您的源下载文件。
只需输入 http://blah/data 作为输入,hadoop 就会完成剩下的工作。它当然可以与 s3 一起使用,为什么它不能与 http 一起使用?
I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?