远程启动 Amazon Elastic MapReduce 作业?

发布于 2024-09-15 16:27:37 字数 121 浏览 7 评论 0原文

我正在开发一个小项目,以熟悉亚马逊网络服务。我正在尝试制作一个简单的网络应用程序;当按下按钮时,将启动 MapReduce 作业,并将输出返回到浏览器上。 最好的方法是什么?另外,有没有办法通过命令行启动亚马逊弹性映射缩减作业?

I'm working on a small project to get myself acquainted with the Amazon web services. I'm trying to make a simple web application; when a button is pressed a mapreduce job is launched and the output is returned on the browser.
What would be the best way to do this? Also, is there a way to launch an amazon elastic mapreduce job via the command line?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

过气美图社 2024-09-22 16:27:37

您可以使用以任何语言编写 Web 应用程序的 AWS 开发工具包来调用 EMR 来提交作业。我主要使用 python,因此我最熟悉 Python Boto 库,这使得将代码和数据上传到 s3、配置作业流程并启动该作业流程变得非常轻松。

您不想启动作业并在同一个 HTTP 请求中返回结果,因为在作业能够运行之前启动集群就需要几分钟的时间。页面几分钟内没有响应的 Web 应用程序并不是良好的用户体验。然而,仅仅提交一个作业流程似乎只需要几秒钟的时间。您需要创建作业流程并在 Web 应用程序中跟踪作业流程 ID。给定作业流 ID,当用户返回并且作业完成时,从作业流检索日志数据或输出应该不会有太多麻烦。

以下是如何使用 Boto 启动 Elastic MR 作业的示例:

import boto
from boto.emr.step import StreamingStep

conn = boto.connect_emr()
step = StreamingStep(name='My wordcount example',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://<my output bucket>/output/wordcount_output')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step])

You can use the AWS SDK in whatever language you're writing your web application in to make calls to EMR to submit a job. I work mostly with python so I'm most familiar with the Python Boto library which makes it pretty painless to upload code and data to s3, configure a jobflow and launch that job flow.

You won't want to launch the job and return the results in the same HTTP request as it will take several minutes just to start the cluster before the job will be able to run. A web application with pages that don't respond for minutes isn't a good user experience. However, just submitting a jobflow seems to only take a few seconds. You'll need to create the job flow and just keep track of the jobflow ids in your web application. Given a jobflow ID you shouldn't have too much trouble retrieving log data or output from the jobflow when the user comes back and the job is complete.

Here's an example of how one could launch an Elastic MR job with Boto:

import boto
from boto.emr.step import StreamingStep

conn = boto.connect_emr()
step = StreamingStep(name='My wordcount example',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://<my output bucket>/output/wordcount_output')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step])
情痴 2024-09-22 16:27:37

你看过这个了吗? http://developer.amazonwebservices.com/connect/entry.jspa?externalID= 873 它来自开发方面,可能会对您有所帮助。

Did you give this a look yet? http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 It's from the dev side and might help you along.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文