Nutch solrindex 命令未对 Solr 中的所有 URL 建立索引
我有一个从特定域爬取的 Nutch 索引,并且我正在使用 solrindex 命令将爬取的数据推送到我的 Solr 索引。问题在于,似乎只有部分爬行的 URL 实际上在 Solr 中建立了索引。我将 Nutch 抓取输出保存到一个文本文件中,这样我就可以看到它抓取的 URL,但是当我在 Solr 中搜索某些抓取的 URL 时,我没有得到任何结果。
我用来执行 Nutch 抓取的命令: bin/nutchcrawl urls -dircrawl -depth 20 -topN 2000000
此命令已成功完成,输出显示我在生成的 Solr 索引中找不到的 URL 。
我用来将爬取的数据推送到 Solr 的命令: bin/nutch solrindex http://localhost:8983/solr /crawldbcrawl/linkdbcrawl/segments/*
该命令的输出表明它也成功完成,因此它似乎不是进程提前终止的问题(这就是我最初认为可能是)。
我觉得奇怪的最后一件事是整个 Nutch & Solr 配置与我之前在不同服务器上使用的设置相同,当时没有任何问题。它实际上是复制到这个新服务器上的相同配置文件。
TL;DR: 我在 Nutch 中成功抓取了一组 URL,但是当我运行 solrindex 命令时,只有其中一些被推送到 Solr。请帮忙。
更新:我已经重新运行所有这些命令,并且输出仍然坚持认为一切正常。我已经研究了我能想到的任何索引拦截器,但仍然没有运气。传递给 Solr 的 URL 都是活动的并且可公开访问,因此这不是问题。我真的很头撞墙,所以希望得到一些帮助。
I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.
Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.
Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).
One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.
TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.
UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我只能根据我的经验猜测发生了什么:
有一个名为 url-normalizer 的组件(及其配置 url-normalizer.xml),它会截断一些 url(删除 URL 参数、SessionIds...)
此外,Nutch 使用独特的约束,默认情况下每个 url 只保存一次。
因此,如果规范化器将 2 个或更多 URL('foo.jsp?param=value'、'foo.jsp?param=value2'、'foo.jsp?param=value3'、...)截断为完全相同的 URL ('foo.jsp'),它们只保存一次。因此 Solr 只会看到所有已爬网 URL 的子集。
干杯
I can only guess what happend from my experiences:
There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)
Additionally, Nutch uses a unique constraint, by default each url is only saved once.
So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.
cheers