运行Hadoop MapReduce,是否可以调用HDFS之外的外部可执行文件
在我的映射器中,我想调用 HDFS 之外的工作节点上安装的外部软件。这可能吗?最好的方法是什么?
我知道这可能会带走 MapReduce 的一些优势/可扩展性,但我想在 HDFS 内进行交互,并在我的映射器中调用编译/安装的外部软件代码来处理一些数据。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
映射器(和缩减器)就像盒子上的任何其他进程一样 - 只要 TaskTracker 用户有权运行可执行文件,这样做就没有问题。有几种方法可以调用外部进程,但由于我们已经使用 Java,ProcessBuilder 似乎是一个合乎逻辑的起点。
编辑:刚刚发现 Hadoop 有一个明确用于此目的的类: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html
Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.
EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html
这当然是可行的。您可能会发现最好使用 Hadoop Streaming。正如该网站上所说:
我倾向于从 Hadoop Streaming 内部的外部代码开始。根据您的语言,可能有很多关于如何在流媒体中使用它的好例子;一旦你进入了你选择的语言,如果需要的话,你通常可以将数据传输到另一个程序。与在普通 Linux 机器上运行相比,我已经使用不同语言编写了多个层的程序,除了让外层与 Hadoop Streaming 一起工作之外,不需要额外的工作就可以很好地运行。
This is certainly doable. You may find it best to work with Hadoop Streaming. As it says on that website:
I tend to start with external code inside of Hadoop Streaming. Depending on your language, there are likely many good examples of how to use it in Streaming; once you get inside your language of choice, you can usually pipe data out to another program, if desired. I have had several layers of programs in different languages playing nicely with no additional effort than if I had run it on a normal Linux box, beyond just getting the outer layer working with Hadoop Streaming.