当前位置：文江博客话题详情

通过 lucene 使用预标记化文本

发布于 2025-01-05 16:32:16 字数 193 浏览 4 评论 0原文

我的数据已经使用外部资源标记化，我想在 lucene 中使用该数据。我的第一个想法是使用 \x01 连接这些字符串，并使用 WhiteSpaceTokenizer 再次拆分它们。有更好的主意吗？（输入是 XML 格式）

作为奖励，这个带注释的数据还包含同义词，我将如何注入它们（表示为 XML 标签）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

咋地 2025-01-12 16:32:17

Lucene 允许您向字段提供自己的令牌流，绕过令牌化步骤。为此，您可以创建自己的 TokenStream 子类，实现incrementToken()，然后调用 field.setTokenStream(new MyTokenStream(yourTokens))：

public class MyTokenStream extends TokenStream {
    CharTermAttribute charTermAtt;
    OffsetAttribute offsetAtt;

    final Iterator<MyToken> listOfTokens;

    MyTokenStream(Iterator<MyToken> tokenList) {
        listOfTokens = tokenList;
        charTermAtt = addAttribute(CharTermAttribute.class);
        offsetAtt = addAttribute(OffsetAttribute.class);

    }

    @Override
    public boolean incrementToken() throws IOException {
        if(listOfTokens.hasNext()) {
            super.clearAttributes();
            MyToken myToken = listOfTokens.next();
            charTermAtt.setLength(0);
            charTermAtt.append(myToken.getText());
            offsetAtt.setOffset(myToken.begin(), myToken.end());
            return true;
        }
        return false;
    }
}

Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):

public class MyTokenStream extends TokenStream {
    CharTermAttribute charTermAtt;
    OffsetAttribute offsetAtt;

    final Iterator<MyToken> listOfTokens;

    MyTokenStream(Iterator<MyToken> tokenList) {
        listOfTokens = tokenList;
        charTermAtt = addAttribute(CharTermAttribute.class);
        offsetAtt = addAttribute(OffsetAttribute.class);

    }

    @Override
    public boolean incrementToken() throws IOException {
        if(listOfTokens.hasNext()) {
            super.clearAttributes();
            MyToken myToken = listOfTokens.next();
            charTermAtt.setLength(0);
            charTermAtt.append(myToken.getText());
            offsetAtt.setOffset(myToken.begin(), myToken.end());
            return true;
        }
        return false;
    }
}

回复收藏 0 原文