在现有 Hadoop 集群上运行 Nutch
我们有一个 Hadoop 集群 (Hadoop 0.20),我想使用 Nutch 1.2 通过 HTTP 将一些文件导入 HDFS,但我无法让 Nutch 在集群上运行。
我已经更新了 $HADOOP_HOME/bin/hadoop 脚本以将 Nutch jar 添加到类路径(实际上我已经从 $NUTCH_HOME/bin/nutch 复制了类路径设置脚本,没有将 $NUTCH_HOME/lib/* 添加到类路径的部分),然后我尝试运行以下命令来注入 URL:
hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf /nutch-site.xmlcrawl_path urls_path
但我得到了 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not find.
$NUTCH_HOME/conf /nutch-site.xml 配置文件将该属性设置
<property>
<name>mapreduce.job.jar.unpack.pattern</name>
<value>(?:classes/|lib/|plugins/).*</value>
</property>
为解决方法,以强制解压 /plugin 目录,如以下建议:
有人遇到过这个问题吗?您是否有关于如何在现有 Hadoop 上运行 Nutch 的分步教程?
提前致谢,
米哈埃拉
We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.
I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:
hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path
but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property
<property>
<name>mapreduce.job.jar.unpack.pattern</name>
<value>(?:classes/|lib/|plugins/).*</value>
</property>
as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.
Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?
Thanks in advance,
mihaela
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最后,我使用 bin/hadoop 脚本运行 Nutch MapReduce 作业(注入器、生成器和 Fetcher),没有对 Nutch 进行任何修改。
问题在于
org.apache.hadoop.util.RunJar
类(在调用hadoop jarjobClass
时运行 hadoop job jar 的类),该类添加了作业 jar 文件中的类路径只有classes/
和lib/
子目录,Nutch 作业有一个plugins
子文件夹,其中还包含以下使用的插件运行时。我尝试将属性mapreduce.job.jar.unpack.pattern
覆盖为值(?:classes/|lib/|plugins/).*
以便 RunJar 类添加还有类路径的插件,但它不起作用。查看 Nutch 代码后,我发现它使用了一个属性
plugin.folders
来控制在哪里可以找到插件。因此,我所做的并且有效的是将插件子文件夹从作业 jar 复制到共享驱动器,并在每次运行 Nutch 作业时将属性plugin.folders
设置为该路径。例如:在
conf/nutch-default.xml
文件中,我设置了一些属性,如代理名称、代理主机和端口、超时、内容限制等。我还尝试创建Nutch 作业 jar 与 lib 子文件夹中的插件子文件夹,然后将
plugin.folders
属性设置为值lib/plugins
但它不起作用......Finally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.
The problem is with
org.apache.hadoop.util.RunJar
class (the class which runs a hadoop job jar when callinghadoop jar <jobfile> jobClass
) that adds to the classpath from the job jar file only theclasses/
andlib/
subdirectories and Nutch jobs have aplugins
subfolder also which containes the plugins used at runtime. I tried overriding the propertymapreduce.job.jar.unpack.pattern
to value(?:classes/|lib/|plugins/).*
so that the RunJar class add also the plugins to the classpath but it didn't work.After looking in Nutch code I saw that it uses a property
plugin.folders
which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the propertyplugin.folders
to that path each time I run a Nutch job. For example:In the
conf/nutch-default.xml
file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the
plugin.folders
property to valuelib/plugins
but it didn't work....我在现有的 hadoop 集群上运行 Nutch,修改 bin/nutch 脚本,然后复制 hadoop 文件夹上的 nutch 配置文件,修改 TS 和 NS 参数。你这样尝试过吗?
I ran Nutch on an existing hadoop cluster modifying the bin/nutch script and then copying the nutch config files on the hadoop folders, modifying the TS and NS parameters. Did you try it that way?