坚果路径错误

发布于 2024-12-12 05:39:52 字数 1704 浏览 0 评论 0原文

你好,我已经在 Ubuntu 上安装了 solr 和 nutch。我偶尔可以进行爬网和索引,但并非总是如此。我反复遇到此路径错误,但在网上找不到解决方案。通常,我会删除有错误的目录并重新运行,它会运行良好。但我不想再这样做了。是什么导致了错误?谢谢。

LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027231916
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027232907
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027233840
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027224701
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027231916/parse_data
Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027232907/parse_data
Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027233840/parse_data
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

Hi I have installed solr and nutch on Ubuntu. I am able to crawl and index on occassions, but not all the time. I Have been getting this path error repeatedly and could not find a solution online. Usually, I would delete the directories which have errors and rerun, it would run fine. But I dont want to do this anymore. What is causing the error? Thanks.

LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027231916
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027232907
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027233840
LinkDb: adding segment: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027224701
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027231916/parse_data
Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027232907/parse_data
Input path does not exist: file:/home/nutch/nutch/runtime/local/crawl/segments/20111027233840/parse_data
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

临风闻羌笛 2024-12-19 05:39:52

您肯定已经终止了 Nutch 进程。只需清除目录crawldb等就可以了。

Nutch 首先在爬行路径中查找现成的链接数据库 (linkdb),如果找不到,则根据您提供的种子文件创建一个新数据库。如果终止爬行进程,则会导致从链接数据库读取失败。

You must have killed a Nutch process. Just clear the directories crawldb etc. and you're good to go.

Nutch first looks for a ready link database (linkdb) in the crawl path, if can't find it, creates a new one from the seed file you give. If you kill a crawling process, this causes that read from the link database fail.

在风中等你 2024-12-19 05:39:52
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

检查爬网文件夹是否存在并具有适当的权限,并且您需要像上面一样使用 -linkdb ,在新版本中它是可选的。大多数情况下,此错误是由于您为crawldb linkdb 指定的路径和未正确给出的segments 路径造成的。

我遇到了同样的问题,我使用上面的语法它起作用了。只需检查您指定的文件夹是否正确。

使用这个,

http://thetechietutorials.blogspot.com/2011/ 06/solr-and-nutch-integration.html

对我有用。

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

Check for the crawl folder exist and has proper permission, and you need to use -linkdb as above as in new version its optional. Mostly this error come due to the path you are specifying for crawldb linkdb and segements path not given correctly.

I had the same problem I used above syntax it worked. Just check the folder you are specifying for these are correct.

Use this,

http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html

worked for me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文