远程启动 Amazon Elastic MapReduce 作业？

发布于 2024-09-15 16:27:37 字数 121 浏览 15 评论 0原文

我正在开发一个小项目，以熟悉亚马逊网络服务。我正在尝试制作一个简单的网络应用程序；当按下按钮时，将启动 MapReduce 作业，并将输出返回到浏览器上。最好的方法是什么？另外，有没有办法通过命令行启动亚马逊弹性映射缩减作业？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

过气美图社 2024-09-22 16:27:37

您可以使用以任何语言编写 Web 应用程序的 AWS 开发工具包来调用 EMR 来提交作业。我主要使用 python，因此我最熟悉 Python Boto 库，这使得将代码和数据上传到 s3、配置作业流程并启动该作业流程变得非常轻松。

您不想启动作业并在同一个 HTTP 请求中返回结果，因为在作业能够运行之前启动集群就需要几分钟的时间。页面几分钟内没有响应的 Web 应用程序并不是良好的用户体验。然而，仅仅提交一个作业流程似乎只需要几秒钟的时间。您需要创建作业流程并在 Web 应用程序中跟踪作业流程 ID。给定作业流 ID，当用户返回并且作业完成时，从作业流检索日志数据或输出应该不会有太多麻烦。

以下是如何使用 Boto 启动 Elastic MR 作业的示例：

import boto
from boto.emr.step import StreamingStep

conn = boto.connect_emr()
step = StreamingStep(name='My wordcount example',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://<my output bucket>/output/wordcount_output')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step])

You can use the AWS SDK in whatever language you're writing your web application in to make calls to EMR to submit a job. I work mostly with python so I'm most familiar with the Python Boto library which makes it pretty painless to upload code and data to s3, configure a jobflow and launch that job flow.

You won't want to launch the job and return the results in the same HTTP request as it will take several minutes just to start the cluster before the job will be able to run. A web application with pages that don't respond for minutes isn't a good user experience. However, just submitting a jobflow seems to only take a few seconds. You'll need to create the job flow and just keep track of the jobflow ids in your web application. Given a jobflow ID you shouldn't have too much trouble retrieving log data or output from the jobflow when the user comes back and the job is complete.

Here's an example of how one could launch an Elastic MR job with Boto:

import boto
from boto.emr.step import StreamingStep

conn = boto.connect_emr()
step = StreamingStep(name='My wordcount example',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://<my output bucket>/output/wordcount_output')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step])

回复收藏 0 原文