如何为 AWS MapReduce 准备数据和后续处理数据
我正在使用 Amazon MapReduce Web 服务来完成一个大学项目。为了将数据用于 MapReduce,我需要将它们从关系数据库 (AWS RDS) 转储到 S3。 MapReduce 完成后,我需要拆分输出文件并将其块加载到它们自己的 S3 存储桶中。
在 Amazon Web Services 环境中执行此操作的好方法是什么?
最好的情况:除了用于 RDS 和 MapReduce 的实例之外,是否可以在不使用额外 EC2 实例的情况下完成此任务?
我使用 python 作为映射器和减速器函数,并使用 json 说明符作为 MapReduce 作业流程。除此之外,我不受语言或技术的限制。
I am working with Amazons MapReduce Web Service for an university project. In order to use the data for MapReduce, I need to dump them from a relational database (AWS RDS) into S3. After MapReduce finishes I need to split the output file and load chunks of it into their own S3 buckets.
What is a good way to do this within the Amazon Web Services Enviroment?
Best case: Could this be a accomplished without using extra EC2 instances besides the ones used for RDS and MapReduce?
I use python for the mapper and reducer functions and json specifiers for the MapReduce job-flow. Otherwise I am not language or technology bound.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您查看 Amazon Elastic MapReduce 开发人员指南,您需要指定S3中输入数据、输出数据、Mapper脚本和Reducer脚本的位置,以创建MapReduce作业流程。
如果您需要执行一些预处理(例如从数据库转储 MapReduce 输入文件)或后处理(例如将 MapReduce 输出文件拆分到 S3 中的其他位置),则必须将这些任务与MapReduce 作业流程。
您可以使用
boto
库来编写这些预处理和后处理脚本。它们可以在 EC2 实例或任何其他有权访问 S3 存储桶的计算机上运行。从 EC2 传输数据可能更便宜、更快,但如果您没有可用的 EC2 实例,您可以在自己的计算机中运行脚本...除非有太多数据需要传输!您可以通过自动化达到您想要的程度:您甚至可以编排生成输入、启动新的 MapReduce 作业流程、等待作业完成并相应地处理输出的整个过程,以便在适当的配置下,整个过程减少到按一个按钮:)
If you take a look at the Amazon Elastic MapReduce Developer Guide you need to specify the location of input data, output data, mapper script and reducer script in S3 in order to create a MapReduce job flow.
If you need to do some pre-processing (such as dumping the MapReduce input file from a database) or post-processing (such as splitting the MapReduce output file to other locations in S3), you will have to automate those tasks separately from the MapReduce job flow.
You may use the
boto
library to write those pre-processing and post-processing scripts. They can be run on an EC2 instance or any other computer with access to the S3 bucket. Data transfer from EC2 may be cheaper and faster, but if you don't have an EC2 instance available for this, you could run the scripts in your own computer... unless there is too much data to transfer!You can go as far as you want with automation: You may even orchestrate the whole process of generating input, launching a new MapReduce job flow, waiting for the job to finish and processing output accordingly, so that given the proper configuration, the whole thing is reduced to pushing a button :)