Nutch 抓取错误 - 输入路径不存在

发布于 2024-12-03 22:00:02 字数 1139 浏览 0 评论 0原文

我有 2 个 datanode 服务器的 nutch/hadoop 设置。我尝试抓取一些网址,但 nutch 失败并出现以下错误:

Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

有人可以帮助我吗?我不知道如何解决这个问题! 很多很多谢谢!

I have nutch/hadoop setup with 2 datanode server. I tried to crawl some urls but nutch fails with this error:

Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

Can someone help me? I don't know how to solve this!
Many many Thx!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

尹雨沫 2024-12-10 22:00:02

验证nutch/crawl/segments/crawl_generate路径是否正确。

路径错误或解析阶段未完成。

verify whether nutch/crawl/segments/crawl_generate path is correct.

Either path is wrong or parse phase is not completed.

万人眼中万个我 2024-12-10 22:00:02

nutch的生成阶段在segments目录中创建“crawl_generate”。这包含在获取阶段使用的获取列表。您收到的错误是因为获取阶段无法获取获取列表。确保generate 的输出填充在fetch 试图找到它的位置。

The generate phase of nutch creates "crawl_generate" inside the segments directory. This contains the fetch list used in the fetch phase. The error that you got is because the fetch phase is unable to get the fetch list. Ensure that the output of generate is populated at the location where fetch is trying to find it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文