我应该使用什么分析器来分析 lucene.net 中的 URL?
我在获取一个简单的 URL 来正确标记以便您可以按预期搜索它时遇到问题。
我正在索引“http://news.bbc.co。 uk/sport1/hi/football/internationals/8196322.stm" 使用 StandardAnalyzer 并将字符串标记为以下内容(调试输出):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
一般来说,它看起来不错,http 本身,然后是主机名,但问题是似乎带有正斜杠。它肯定应该将它们视为单独的词吗?
我需要做什么来纠正这个问题?
谢谢
我正在使用 Lucene.NET,但我真的不认为它对答案有太大影响。
I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 StandardTokenizer 的 StandardAnalyzer 不会对 url 进行标记(尽管它可以识别电子邮件并将其视为一个标记)。您所看到的是它的默认行为 - 分割各种标点符号。最简单的解决方案可能是使用编写自定义分析器并提供 UrlTokenizer,该 UrlTokenizer 扩展/修改 StandardTokenizer 中的代码以标记 URL。类似于:
}
URLTokenizer 在 /、- _ 和任何您想要的其他内容上分割。 Nutch可能也有一些相关代码,但不知道是否有.NET版本。
请注意,如果 url 有不同的 fieldName,则可以修改上述代码,默认使用 StandardTokenizer,否则使用 UrlTokenizer。
例如
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
您应该自己解析 URL(我想至少有一个 .Net 类可以解析 URL 字符串并梳理出不同的元素),然后添加这些元素(例如主机,或您感兴趣的任何其他元素进行过滤) ) 作为关键字;根本不分析它们。
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.