运行Hadoop MapReduce,是否可以调用HDFS之外的外部可执行文件

发布于 2024-12-02 21:46:28 字数 141 浏览 1 评论 0 原文

在我的映射器中,我想调用 HDFS 之外的工作节点上安装的外部软件。这可能吗?最好的方法是什么?

我知道这可能会带走 MapReduce 的一些优势/可扩展性,但我想在 HDFS 内进行交互,并在我的映射器中调用编译/安装的外部软件代码来处理一些数据。

Within my mapper I'd like to call external software installed on the worker node outside of the HDFS. Is this possible? What is the best way to do this?

I understand that this may take some of the advantages/scalability of MapReduce away, but i'd like to interact both within the HDFS and call compiled/installed external software codes within my mapper to process some data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

望她远 2024-12-09 21:46:28

映射器(和缩减器)就像盒子上的任何其他进程一样 - 只要 TaskTracker 用户有权运行可执行文件,这样做就没有问题。有几种方法可以调用外部进程,但由于我们已经使用 Java,ProcessBuilder 似乎是一个合乎逻辑的起点。

编辑:刚刚发现 Hadoop 有一个明确用于此目的的类: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html

Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.

EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html

臻嫒无言 2024-12-09 21:46:28

这当然是可行的。您可能会发现最好使用 Hadoop Streaming。正如该网站上所说:

Hadoop 流是 Hadoop 发行版附带的实用程序。该实用程序允许您使用任何可执行文件或脚本作为映射器和/或化简器来创建和运行映射/化简作业。

我倾向于从 Hadoop Streaming 内部的外部代码开始。根据您的语言,可能有很多关于如何在流媒体中使用它的好例子;一旦你进入了你选择的语言,如果需要的话,你通常可以将数据传输到另一个程序。与在普通 Linux 机器上运行相比,我已经使用不同语言编写了多个层的程序,除了让外层与 Hadoop Streaming 一起工作之外,不需要额外的工作就可以很好地运行。

This is certainly doable. You may find it best to work with Hadoop Streaming. As it says on that website:

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer.

I tend to start with external code inside of Hadoop Streaming. Depending on your language, there are likely many good examples of how to use it in Streaming; once you get inside your language of choice, you can usually pipe data out to another program, if desired. I have had several layers of programs in different languages playing nicely with no additional effort than if I had run it on a normal Linux box, beyond just getting the outer layer working with Hadoop Streaming.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文