在现有 Hadoop 集群上运行 Nutch

发布于 2024-10-22 01:21:04 字数 830 浏览 4 评论 0原文

我们有一个 Hadoop 集群 (Hadoop 0.20)，我想使用 Nutch 1.2 通过 HTTP 将一些文件导入 HDFS，但我无法让 Nutch 在集群上运行。

我已经更新了 $HADOOP_HOME/bin/hadoop 脚本以将 Nutch jar 添加到类路径（实际上我已经从 $NUTCH_HOME/bin/nutch 复制了类路径设置脚本，没有将 $NUTCH_HOME/lib/* 添加到类路径的部分），然后我尝试运行以下命令来注入 URL：

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf /nutch-site.xmlcrawl_path urls_path

但我得到了 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not find.

$NUTCH_HOME/conf /nutch-site.xml 配置文件将该属性设置

<property>
    <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

为解决方法，以强制解压 /plugin 目录，如以下建议：

有人遇到过这个问题吗？您是否有关于如何在现有 Hadoop 上运行 Nutch 的分步教程？

提前致谢，
米哈埃拉

原文

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.

I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path

but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property

<property>
    <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.

Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?

Thanks in advance,

mihaela

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巾帼英雄 2024-10-29 01:21:04

最后，我使用 bin/hadoop 脚本运行 Nutch MapReduce 作业（注入器、生成器和 Fetcher），没有对 Nutch 进行任何修改。

问题在于 org.apache.hadoop.util.RunJar 类（在调用 hadoop jarjobClass 时运行 hadoop job jar 的类），该类添加了作业 jar 文件中的类路径只有 classes/ 和 lib/ 子目录，Nutch 作业有一个 plugins 子文件夹，其中还包含以下使用的插件运行时。我尝试将属性 mapreduce.job.jar.unpack.pattern 覆盖为值 (?:classes/|lib/|plugins/).* 以便 RunJar 类添加还有类路径的插件，但它不起作用。

查看 Nutch 代码后，我发现它使用了一个属性 plugin.folders 来控制在哪里可以找到插件。因此，我所做的并且有效的是将插件子文件夹从作业 jar 复制到共享驱动器，并在每次运行 Nutch 作业时将属性 plugin.folders 设置为该路径。例如：

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

在conf/nutch-default.xml文件中，我设置了一些属性，如代理名称、代理主机和端口、超时、内容限制等。

我还尝试创建Nutch 作业 jar 与 lib 子文件夹中的插件子文件夹，然后将 plugin.folders 属性设置为值 lib/plugins 但它不起作用......

Finally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.

The problem is with org.apache.hadoop.util.RunJar class (the class which runs a hadoop job jar when calling hadoop jar <jobfile> jobClass) that adds to the classpath from the job jar file only the classes/ and lib/ subdirectories and Nutch jobs have a plugins subfolder also which containes the plugins used at runtime. I tried overriding the property mapreduce.job.jar.unpack.pattern to value (?:classes/|lib/|plugins/).* so that the RunJar class add also the plugins to the classpath but it didn't work.

After looking in Nutch code I saw that it uses a property plugin.folders which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the property plugin.folders to that path each time I run a Nutch job. For example:

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

In the conf/nutch-default.xml file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.

I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the plugin.folders property to value lib/plugins but it didn't work....

回复收藏 0 原文