无法在 Nutch 部署模式下运行 fetcher 作业

发布于 2024-12-28 21:05:42 字数 3542 浏览 5 评论 0原文

我已在 Ubuntu 11.10 系统上使用本地模式成功运行 Nutch (v1.4) 进行爬网。但是,当切换到“部署”模式时(其他一切都相同),我在获取周期期间收到错误。

我的 Hadoop 在伪分布式模式下在机器上成功运行(复制因子为 1,我只有 1 个映射和 1 个减少作业设置)。 “jps”显示所有 Hadoop 守护进程均已启动并正在运行。 18920 日元 14799 数据节点 15127 工作追踪器 14554 名称节点 15361 任务追踪器 15044 secondaryNameNode

我还将 HADOOP_HOME/bin 路径添加到我的 PATH 变量中。

PATH=$PATH:/home/jimb/hadoop/bin

然后我从 nutch/deploy 目录中运行爬网,如下所示:

bin/nutch 抓取 /data/runs/ar/seedurls -dir /data/runs/ar/crawls

这是我得到的输出:

  12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
  12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
  12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
  12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
  12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
  12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
  12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
  12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
  12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
  12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
  12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
  12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
  12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
  12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
  12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
  12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
  12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
  12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
  12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
        at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

现在,“本地”模式的配置文件设置良好(因为本地模式下的爬网)成功了)。对于在部署模式下运行,由于“deploy”文件夹没有任何“conf”子目录,我假设: a) 需要将conf文件复制到“deploy/conf”下,或者 b)conf 文件需要放置到 HDFS 上。

我已经证实上面的选项(a)没有帮助。那么,我假设 Nutch 配置文件需要存在于 HDFS 中,HDFS fetcher 才能成功运行?但是,我不知道应该将这些 Nutch conf 文件放置在 HDFS 中的哪个路径,或者也许我找错了树?

如果 Nutch 在“部署”模式下从“local/conf”下的文件读取配置文件,那么为什么本地爬行工作正常,但部署模式爬行却不行?

我在这里缺少什么?

提前致谢!

I've successfully run Nutch (v1.4) for a crawl using local mode on my Ubuntu 11.10 system. However, when switching over to "deploy" mode (all else being the same), I get an error during the fetch cycle.

I have Hadoop running succesfully on the machine, in a pseudo-distributed mode (replication factor is 1 and I have just 1 map and 1 reduce job setup). "jps" shows that all Hadoop daemons are up and running.
18920 Jps
14799 DataNode
15127 JobTracker
14554 NameNode
15361 TaskTracker
15044 SecondaryNameNode

I have also added the HADOOP_HOME/bin path to my PATH variable.

PATH=$PATH:/home/jimb/hadoop/bin

Then I ran the crawl from the nutch/deploy directory, as below:

bin/nutch crawl /data/runs/ar/seedurls -dir /data/runs/ar/crawls

Here is the output I get:

  12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
  12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
  12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
  12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
  12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
  12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
  12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
  12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
  12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
  12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
  12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
  12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
  12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
  12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
  12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
  12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
  12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
  12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
  12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
        at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Now, the configuration files for the "local" mode are setup fine (since a crawl in local mode succeeded). For running in deploy mode, since the "deploy" folder did not have any "conf" subdirectory, I assumed that either:
a) the conf files need to be copied over under "deploy/conf", OR
b) the conf files need to be placed onto HDFS.

I have verified that option (a) above does not help. So, I'm assuming that the Nutch configuration files need to exist in HDFS, for the HDFS fetcher to run successfully? However, I don't know at what path within HDFS I should place these Nutch conf files, or perhaps I'm barking up the wrong tree?

If Nutch reads config files during "deploy" mode from the files under "local/conf", then why is it that the local crawl worked fine, but the deploy-mode crawl isn't?

What am I missing here?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

遥远的绿洲 2025-01-04 21:05:42

尝试一下:

  1. 在nutch源目录中,修改文件conf/nutch-site.xml以正确设置http.agent.name

  2. 使用ant重新构建代码

  3. 转到runtime/deploy目录,设置所需的环境变量并再次尝试爬取。

Try this out:

  1. In the nutch source directory, modify the file conf/nutch-site.xml to set http.agent.name properly.

  2. re-build the code using ant

  3. Go to runtime/deploy directory, set the required environment variables and try crawling again.

谈场末日恋爱 2025-01-04 21:05:42

这可能是因为您尚未重建。你能运行“ant”看看会发生什么吗?显然,如果您还没有更新 nutch-site.xml 中的 http.agent.name ,则需要这样做。

This is likely because you have not rebuilt yet. Can you run "ant" and see what happens? Obviously, you need to update the http.agent.name in nutch-site.xml if you have not done so yet.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文