在 Hadoop 流中链接多个 MapReduce 任务

发布于 2024-10-11 01:17:18 字数 205 浏览 10 评论 0原文

我所处的场景是我有两个 MapReduce 作业。我对 python 更熟悉,并计划使用它来编写 mapreduce 脚本,并使用 hadoop 流来实现同样的目的。当使用hadoop流时,是否可以方便地链接以下形式的两个作业?

地图1 ->减少1->地图2-> 我听说过很多

在java中完成此任务的方法,但我需要一些用于Hadoop流的方法。

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

甩你一脸翔 2024-10-18 01:17:18

这是一篇关于如何使用级联和流式传输的精彩博客文章。
http://www.xcombinator。 com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

这里的值是您可以在同一应用程序中将 java(级联查询流)与自定义流操作混合。我发现这比其他方法要脆弱得多。

请注意,Cascading 中的 Cascade 对象允许您链接多个 Flow(通过上面的博客文章,您的 Streaming 作业将成为 MapReduceFlow)。

免责声明:我是 Cascading 的作者

Here is a great blog post on how to use Cascading and Streaming.
http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading

孤檠 2024-10-18 01:17:18

您可以尝试 Yelp 的 MRJob 来获取您的工作完成..它是一个开源MapReduce库,允许您编写可以在Hadoop集群或EC2上的Hadoop Streaming上运行的链式作业..它非常优雅且易于使用,并且有一个名为steps的方法,您可以覆盖它指定您希望数据经过的确切映射器和缩减器链。

查看源代码:https://github.com/Yelp/mrjob
和文档位于 http://packages.python.org/mrjob/

You can try out Yelp's MRJob to get your job done.. Its an opensource MapReduce Library that allows you to write chained jobs that can be run atop Hadoop Streaming on your Hadoop Cluster or EC2.. Its pretty elegant and easy to use, and has a method called steps which you can override to specify the exact chain of mappers and reducers that you want your data to go through.

Checkout the source at https://github.com/Yelp/mrjob
and documentation at http://packages.python.org/mrjob/

最后的乘客 2024-10-18 01:17:18

通常,我使用 Hadoop 流和 Python 执行此操作的方式是在我创建的 bash 脚本中运行作业。我总是从 bash 脚本运行,这样我就可以收到有关错误的电子邮件和有关成功的电子邮件,并使它们更灵活地从包装它的另一个 Ruby 或 Python 脚本传递参数,这些脚本可以在更大的事件处理系统中工作。

因此,第一个命令(作业)的输出是下一个命令(作业)的输入,它可以是 bash 脚本中作为参数从命令行传入的变量(简单而快速)

您可能需要查看 Oozie < a href="http://yahoo.github.com/oozie/design.html" rel="nofollow">http://yahoo.github.com/oozie/design.html Hadoop 的工作流引擎这也将有助于做到这一点(支持流媒体,不是问题)。我开始时没有这个,所以我最终不得不构建自己的东西,但这是一个 kewl 系统并且很有用!!!!

Typically the way I do this with Hadoop streaming and Python is from within my bash script that I create to run the jobs in the first place. Always I run from a bash script, this way I can get emails on errors and emails on success and make them more flexible passing in parameters from another Ruby or Python script wrapping it that can work in a larger event processing system.

So, the output of the first command (job) is the input to the next command (job) which can be variables in your bash script passed in as an argument from the command line (simple and quick)

You might want to checkout Oozie http://yahoo.github.com/oozie/design.html a workflow engine for Hadoop that will help to-do this also (supports streaming, not a problem). I did not have this when I started so I ended up having to build my own thing but this is a kewl system and useful!!!!

撧情箌佬 2024-10-18 01:17:18

如果您已经在用 Python 编写映射器和化简器,我会考虑使用 Dumbo,因为这样的操作很简单。 MapReduce 作业的顺序、Mapper、Reducer 等都位于一个可以从命令行运行的 Python 脚本中。

If you are already writing your mapper and reducer in Python, I would consider using Dumbo where such an operation is straightforward. The sequence of your map reduce jobs, your mapper, reducer etc. are all in one python script that can be run from the command line.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文