Nutch 抓取错误 - 输入路径不存在
我有 2 个 datanode 服务器的 nutch/hadoop 设置。我尝试抓取一些网址,但 nutch 失败并出现以下错误:
Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
有人可以帮助我吗?我不知道如何解决这个问题! 很多很多谢谢!
I have nutch/hadoop setup with 2 datanode server. I tried to crawl some urls but nutch fails with this error:
Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
Can someone help me? I don't know how to solve this!
Many many Thx!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
验证nutch/crawl/segments/crawl_generate路径是否正确。
路径错误或解析阶段未完成。
verify whether nutch/crawl/segments/crawl_generate path is correct.
Either path is wrong or parse phase is not completed.
nutch的生成阶段在segments目录中创建“crawl_generate”。这包含在获取阶段使用的获取列表。您收到的错误是因为获取阶段无法获取获取列表。确保generate 的输出填充在fetch 试图找到它的位置。
The generate phase of nutch creates "crawl_generate" inside the segments directory. This contains the fetch list used in the fetch phase. The error that you got is because the fetch phase is unable to get the fetch list. Ensure that the output of generate is populated at the location where fetch is trying to find it.