将 URL 参数添加到 Nutch/Solr 索引和搜索结果

发布于 2024-11-17 12:20:29 字数 1138 浏览 9 评论 0原文

我找不到任何关于如何设置 nutch 来不过滤/删除我的 URL 参数的提示。我想对一些页面进行爬网和索引，其中大量内容隐藏在相同的基本 URL 后面（例如 /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 等等）。

regex-normalize.xml 仅从 URL 中删除多余的内容（例如会话 ID 和尾随？）
regex-urlfilter.txt 似乎对我的主机有一个通配符(+^http://$myHost/)

到目前为止，爬行工作正常。有什么想法吗？

干杯，法力

编辑：

解决方案的一部分隐藏在这里：

配置nutch regex-normalize.xml

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

必须修改。必须允许 URL 参数中可能存在的所有字符，例如“?”和“=”。新行看起来像

-[*!@]

现在使用参数对页面进行爬网。但它们尚未带参数发送到 Solr（Solr 仍然从链接中剪切参数）

EDIT2：

Nutch 在如何处理相对 url（'？param=value'）方面存在一些问题。仍然停留在参数问题上：

请参阅 Maling 列表： http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links

原文

I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on).

the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?)
the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/)

The crawling works fine so far. Any ideas?

cheers,
mana

EDIT:

A part of the solution is hidden here:

configuring nutch regex-normalize.xml

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

has to be modfied. One has to allow all chars that may exist in a URL parameter like '?' and '='. The new line looks like

-[*!@]

And pages are crawled now with params. But they are not yet send to Solr with parameters (Solr still cuts the parameters from the links)

EDIT2:

Nutch has some issues on how to handle relative urls ('?param=value'). Still stuck on that Parameter thing:

see maling list: http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧话新听 2024-11-24 12:20:29

您可以在 Nutch 过滤器中创建自定义字段来保存整个 URL。只要您在 Solr 模式中使用 store="true" 定义相同的字段，它就会显示在您的结果中。请参阅WritingPluginExample-1.2。

如果您需要帮助，请告诉我。

回复收藏 0 原文

~没有更多了~

关于作者

独木成林

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

将 URL 参数添加到 Nutch/Solr 索引和搜索结果

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

将 URL 参数添加到 Nutch/Solr 索引和搜索结果

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。