Lucene.NET:驼峰式分词器?

发布于 2024-09-19 06:12:44 字数 374 浏览 5 评论 0原文

我今天开始使用 Lucene.NET,并编写了一个简单的测试方法来对源代码文件进行索引和搜索。问题在于标准分析器/标记器将整个驼峰式源代码标识符名称视为单个标记。

我正在寻找一种方法将驼峰式标识符(例如 MaxWidth)处理为三个标记:maxwidthmaxwidth。我一直在寻找这样的标记器,但找不到。在写我自己的文章之前:这个方向有什么东西吗?或者有比从头开始编写分词器更好的方法吗?

更新:最后我决定亲自动手,自己编写了一个 CamelCaseTokenFilter 。我将在我的博客上写一篇关于它的文章,并更新问题。

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token.

I'm looking for a way to treat camel case identifiers like MaxWidth into three tokens: maxwidth, max and width. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter myself. I'll write a post about it on my blog and I'll update the question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

我为君王 2024-09-26 06:12:44

Solr 有一个 WordDelimiterFactory ,它会生成与您需要的类似的标记生成器。也许你可以将源代码翻译成C#。

Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.

鹿港小镇 2024-09-26 06:12:44

下面的链接可能有助于编写自定义标记生成器...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

Below link might be helpful to write custom tokenizer...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

无人问我粥可暖 2024-09-26 06:12:44

这是我的实现:

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}

Here is my implementation :

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文