当前位置：文江博客话题详情

将大型数据集放到亚马逊弹性地图上减少

发布于 2024-11-03 09:57:16 字数 225 浏览 4 评论 0原文

我想使用 Amazon EMR 来处理一些大型数据集（25GB 以上，可在 Internet 上下载）。与其将数据集下载到我自己的计算机上，然后将其重新上传到 Amazon 上，不如将数据集上传到 Amazon 上的最佳方法是什么？

我是否启动 EC2 实例，从实例内将数据集（使用 wget）下载到 S3，然后在运行 EMR 作业时访问 S3？（我之前没有使用过亚马逊的云基础设施，所以不确定我刚才说的是否有意义。）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我ぃ本無心為│何有愛 2024-11-10 09:57:16

我建议执行以下操作...

启动您的 EMR 集群
elastic-mapreduce --create --alive --other-options-here
登录主节点并从那里下载数据
wget http://blah/data
复制到HDFS
hadoop fs -copyFromLocal data /data

没有真正的理由将原始数据集放入 S3。如果您想保留结果，可以在关闭集群之前将它们移至 S3。

如果数据集由多个文件表示，您可以使用集群在机器上并行下载它。如果是这种情况，请告诉我，我将引导您完成该过程。

垫

回复收藏 0 原文

美人如玉 2024-11-10 09:57:16

如果您刚刚开始尝试 EMR，我猜您希望在 s3 上使用这些，这样您就不必启动交互式 Hadoop 会话（而是通过 AWS 控制台使用 EMR 向导）。

最好的方法是在与 S3 存储桶相同的区域启动一个微实例，使用 wget 下载到该计算机，然后使用类似 s3cmd（您可能需要在实例上安装）。在 Ubuntu 上：

wget http://example.com/mydataset dataset
sudo apt-get install s3cmd 
s3cmd --configure
s3cmd put dataset s3://mybucket/

您希望实例和 s3 存储桶位于同一区域的原因是为了避免额外的数据传输费用。尽管您需要支付 wget 实例的绑定带宽费用，但到 S3 的 xfer 将是免费的。

If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).

The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:

wget http://example.com/mydataset dataset
sudo apt-get install s3cmd 
s3cmd --configure
s3cmd put dataset s3://mybucket/

The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.

回复收藏 0 原文