将 URL 参数添加到 Nutch/Solr 索引和搜索结果
我找不到任何关于如何设置 nutch 来不过滤/删除我的 URL 参数的提示。我想对一些页面进行爬网和索引,其中大量内容隐藏在相同的基本 URL 后面(例如 /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 等等)。
- regex-normalize.xml 仅从 URL 中删除多余的内容(例如会话 ID 和尾随?)
- regex-urlfilter.txt 似乎对我的主机有一个通配符(+^http://$myHost/)
到目前为止,爬行工作正常。有什么想法吗?
干杯, 法力
编辑:
解决方案的一部分隐藏在这里:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
必须修改。必须允许 URL 参数中可能存在的所有字符,例如“?”和“=”。新行看起来像
-[*!@]
现在使用参数对页面进行爬网。但它们尚未带参数发送到 Solr(Solr 仍然从链接中剪切参数)
EDIT2:
Nutch 在如何处理相对 url('?param=value')方面存在一些问题。仍然停留在参数问题上:
请参阅 Maling 列表: http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links
I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on).
- the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?)
- the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/)
The crawling works fine so far. Any ideas?
cheers,
mana
EDIT:
A part of the solution is hidden here:
configuring nutch regex-normalize.xml
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
has to be modfied. One has to allow all chars that may exist in a URL parameter like '?' and '='. The new line looks like
-[*!@]
And pages are crawled now with params. But they are not yet send to Solr with parameters (Solr still cuts the parameters from the links)
EDIT2:
Nutch has some issues on how to handle relative urls ('?param=value'). Still stuck on that Parameter thing:
see maling list: http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在 Nutch 过滤器中创建自定义字段来保存整个 URL。只要您在 Solr 模式中使用 store="true" 定义相同的字段,它就会显示在您的结果中。请参阅WritingPluginExample-1.2。
如果您需要帮助,请告诉我。
You could create a custom field in a Nutch filter to save the entire URL. As long as you define the same field in the Solr schema with store="true" it will show up in your results. See WritingPluginExample-1.2.
Let me know if you'd like some help.