有关 Amazon EMR 上的流式作业流程的问题
我必须使用 Amazon EC2 + S3 + RDS + EMR 制作一个相当复杂的数据处理系统,我有一些一般性问题,希望您能帮助我:
- 我需要使用 R,然后我必须使用 Streaming Job Flow。这是否意味着我失去了 Hive 的功能,并且无法在 EMR 作业之上执行 Hive 查询来处理该数据?
- 我可以运行多个作业流程并与它们交互吗?
- 如何使用相关作业?
- 工作完成后可以重新运行吗?我不想一次计算,我想根据数据进化。
- 我可以将变量传递给作业吗?
- 自动化此操作的正确方法是什么?
I have to make a quite complex data processing system using Amazon EC2 + S3 + RDS + EMR and I have some general questions in which I hope you can help me out:
- I need to use R, then I have to use Streaming Job Flow. Does that mean I lose the power of Hive and I can't execute a Hive query on top of the EMR Job to work with that data?
- Can I have multiple Job Flows running and interacting with them?
- How can I use Dependent Jobs?
- Can you re-run a job once done? I don't want to do the calculation once, I want to evolve according to the data.
- Can I pass variables to Jobs?
- What is the correct way to automate this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以以任何您想要的方式混合工作。例如,一个 R 流作业从 S3 读取数据并将其写入 HDFS,然后是一个 Hive 作业从 HDFS 读取该数据并将其写回 S3。它们都只是 MapReduce 作业。
EMR 中对同时运行的作业流程数量没有限制;唯一强制执行的限制是 EC2 实例的配额。目前尚不支持在两个集群的 HDFS 之间移动数据,但您可以轻松地通过 S3 进行移动。
取决于你所说的依赖工作?您可以使用步骤机制将作业排队以便在彼此之后运行,因此只要您的工作流程可以通过单个序列来描述就可以了。参见[1]
在调试/探索性工作方面,通常最简单的方法是使用 --alive 启动集群,通过 ssh 连接到主节点并直接提交作业。一旦您满意,您就可以使用步骤机制来编排您的工作流程。
是的;您的步骤使您能够完全访问您所提交的作业
只要您的工作流程是线性的,步骤机制就足够了;启动集群并排队要做的事情,确保最后一步输出到 S3,然后让集群自行终止。
垫
[1] http://docs.amazonwebservices.com/ElasticMapReduce /latest/DeveloperGuide/index.html?ProcessingCycle.html
You can mix jobs in whatever way you want. For example an R streaming job that reads from S3 and writes to HDFS followed by a Hive job that reads that data from HDFS and writes back to S3 . They are all just MapReduce jobs.
There is no limitation in EMR about the number of jobflows you can have running at once; the only limit enforced is the quota on EC2 instances. There is no support to move data between the HDFS of two clusters yet but you can go via S3 easily enough.
Depends on you mean by dependent jobs? You can use the step mechanism to queue jobs up to run after each other so as long as you workflow can be described by a single sequence you're ok. see [1]
In terms of debugging / exploratory work it can often be easiest to start a cluster with --alive, ssh to the master node and submit jobs directly. Once you're happy you can use the step mechanism to orchestrate your workflow.
Yes; your steps give you full access to the job you're submitting
As long as your workflow is linear the step mechanism should be enough; start the cluster and just queue up the things to do, make sure the last step outputs to S3 and just let the cluster terminate itself.
Mat
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?ProcessingCycle.html