我应该使用什么分析器来分析 lucene.net 中的 URL?

发布于 2024-08-12 23:18:55 字数 633 浏览 12 评论 0原文

我在获取一个简单的 URL 来正确标记以便您可以按预期搜索它时遇到问题。

我正在索引“http://news.bbc.co。 uk/sport1/hi/football/internationals/8196322.stm" 使用 StandardAnalyzer 并将字符串标记为以下内容(调试输出):

(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)

一般来说,它看起来不错,http 本身,然后是主机名,但问题是似乎带有正斜杠。它肯定应该将它们视为单独的词吗?

我需要做什么来纠正这个问题?

谢谢

​我正在使用 Lucene.NET,但我真的不认为它对答案有太大影响。

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.

I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):

(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)

In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?

What do I need to do to correct this?

Thanks

P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

强者自强 2024-08-19 23:18:55

使用 StandardTokenizer 的 StandardAnalyzer 不会对 url 进行标记(尽管它可以识别电子邮件并将其视为一个标记)。您所看到的是它的默认行为 - 分割各种标点符号。最简单的解决方案可能是使用编写自定义分析器并提供 UrlTokenizer,该 UrlTokenizer 扩展/修改 StandardTokenizer 中的代码以标记 URL。类似于:

public class MyAnalyzer extends Analyzer {

public MyAnalyzer() {
    super();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new MyUrlTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result);
    result = new SynonymFilter(result);

    return result;
}

}

URLTokenizer 在 /、- _ 和任何您想要的其他内容上分割。 Nutch可能也有一些相关代码,但不知道是否有.NET版本。

请注意,如果 url 有不同的 fieldName,则可以修改上述代码,默认使用 StandardTokenizer,否则使用 UrlTokenizer。

例如

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = null;
            if (fieldName.equals("url")) {
                  result = new MyUrlTokenizer(reader);
            } else {
                  result = new StandardTokenizer(reader);
            }

The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:

public class MyAnalyzer extends Analyzer {

public MyAnalyzer() {
    super();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new MyUrlTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result);
    result = new SynonymFilter(result);

    return result;
}

}

Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.

Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.

e.g.

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = null;
            if (fieldName.equals("url")) {
                  result = new MyUrlTokenizer(reader);
            } else {
                  result = new StandardTokenizer(reader);
            }
初见终念 2024-08-19 23:18:55

您应该自己解析 URL(我想至少有一个 .Net 类可以解析 URL 字符串并梳理出不同的元素),然后添加这些元素(例如主机,或您感兴趣的任何其他元素进行过滤) ) 作为关键字;根本不分析它们。

You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文