在现有 Hadoop 集群上运行 Nutch

发布于 2024-10-22 01:21:04 字数 830 浏览 4 评论 0原文

我们有一个 Hadoop 集群 (Hadoop 0.20),我想使用 Nutch 1.2 通过 HTTP 将一些文件导入 HDFS,但我无法让 Nutch 在集群上运行。

我已经更新了 $HADOOP_HOME/bin/hadoop 脚本以将 Nutch jar 添加到类路径(实际上我已经从 $NUTCH_HOME/bin/nutch 复制了类路径设置脚本,没有将 $NUTCH_HOME/lib/* 添加到类路径的部分),然后我尝试运行以下命令来注入 URL:

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf /nutch-site.xmlcrawl_path urls_path

但我得到了 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not find.

$NUTCH_HOME/conf /nutch-site.xml 配置文件将该属性设置

<property>
    <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

为解决方法,以强制解压 /plugin 目录,如以下建议:

有人遇到过这个问题吗?您是否有关于如何在现有 Hadoop 上运行 Nutch 的分步教程?

提前致谢,
米哈埃拉

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.

I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path

but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property

<property>
    <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.

Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?

Thanks in advance,

mihaela

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

巾帼英雄 2024-10-29 01:21:04

最后,我使用 bin/hadoop 脚本运行 Nutch MapReduce 作业(注入器、生成器和 Fetcher),没有对 Nutch 进行任何修改。

问题在于 org.apache.hadoop.util.RunJar 类(在调用 hadoop jarjobClass 时运行 hadoop job jar 的类),该类添加了作业 jar 文件中的类路径只有 classes/lib/ 子目录,Nutch 作业有一个 plugins 子文件夹,其中还包含以下使用的插件运行时。我尝试将属性 mapreduce.job.jar.unpack.pattern 覆盖为值 (?:classes/|lib/|plugins/).* 以便 RunJar 类添加还有类路径的插件,但它不起作用。

查看 Nutch 代码后,我发现它使用了一个属性 plugin.folders 来控制在哪里可以找到插件。因此,我所做的并且有效的是将插件子文件夹从作业 jar 复制到共享驱动器,并在每次运行 Nutch 作业时将属性 plugin.folders 设置为该路径。例如:

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

conf/nutch-default.xml文件中,我设置了一些属性,如代理名称、代理主机和端口、超时、内容限制等。

我还尝试创建Nutch 作业 jar 与 lib 子文件夹中的插件子文件夹,然后将 plugin.folders 属性设置为值 lib/plugins 但它不起作用......

Finally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.

The problem is with org.apache.hadoop.util.RunJar class (the class which runs a hadoop job jar when calling hadoop jar <jobfile> jobClass) that adds to the classpath from the job jar file only the classes/ and lib/ subdirectories and Nutch jobs have a plugins subfolder also which containes the plugins used at runtime. I tried overriding the property mapreduce.job.jar.unpack.pattern to value (?:classes/|lib/|plugins/).* so that the RunJar class add also the plugins to the classpath but it didn't work.

After looking in Nutch code I saw that it uses a property plugin.folders which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the property plugin.folders to that path each time I run a Nutch job. For example:

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

In the conf/nutch-default.xml file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.

I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the plugin.folders property to value lib/plugins but it didn't work....

饭团 2024-10-29 01:21:04

我在现有的 hadoop 集群上运行 Nutch,修改 bin/nutch 脚本,然后复制 hadoop 文件夹上的 nutch 配置文件,修改 TS 和 NS 参数。你这样尝试过吗?

I ran Nutch on an existing hadoop cluster modifying the bin/nutch script and then copying the nutch config files on the hadoop folders, modifying the TS and NS parameters. Did you try it that way?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文