如何使用 Lucene 分析器来标记字符串？

发布于 2024-11-15 06:09:37 字数 286 浏览 8 评论 0原文

有没有一种简单的方法可以使用 Lucene 的 Analyzer 的任何子类来解析/标记 String？

像这样的东西：

String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...);
List<String> tokenized_string = analyzer.analyze(to_be_parsed);

原文

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String?

Something like:

String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...);
List<String> tokenized_string = analyzer.analyze(to_be_parsed);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

大海や 2024-11-22 06:09:37

根据上面的答案，稍加修改即可与 Lucene 4.0 一起使用。

public final class LuceneUtil {

  private LuceneUtil() {}

  public static List<String> tokenizeString(Analyzer analyzer, String string) {
    List<String> result = new ArrayList<String>();
    try {
      TokenStream stream  = analyzer.tokenStream(null, new StringReader(string));
      stream.reset();
      while (stream.incrementToken()) {
        result.add(stream.getAttribute(CharTermAttribute.class).toString());
      }
    } catch (IOException e) {
      // not thrown b/c we're using a string reader...
      throw new RuntimeException(e);
    }
    return result;
  }

}

Based off of the answer above, this is slightly modified to work with Lucene 4.0.

public final class LuceneUtil {

  private LuceneUtil() {}

  public static List<String> tokenizeString(Analyzer analyzer, String string) {
    List<String> result = new ArrayList<String>();
    try {
      TokenStream stream  = analyzer.tokenStream(null, new StringReader(string));
      stream.reset();
      while (stream.incrementToken()) {
        result.add(stream.getAttribute(CharTermAttribute.class).toString());
      }
    } catch (IOException e) {
      // not thrown b/c we're using a string reader...
      throw new RuntimeException(e);
    }
    return result;
  }

}

回复收藏 0 原文

南街九尾狐 2024-11-22 06:09:37

据我所知，你必须自己编写循环。像这样的东西（直接取自我的源代码树）：

public final class LuceneUtils {

    public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {

        List<String> result = new ArrayList<String>();
        TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));

        try {
            while(stream.incrementToken()) {
                result.add(stream.getAttribute(TermAttribute.class).term());
            }
        }
        catch(IOException e) {
            // not thrown b/c we're using a string reader...
        }

        return result;
    }  
}

As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):

public final class LuceneUtils {

    public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {

        List<String> result = new ArrayList<String>();
        TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));

        try {
            while(stream.incrementToken()) {
                result.add(stream.getAttribute(TermAttribute.class).term());
            }
        }
        catch(IOException e) {
            // not thrown b/c we're using a string reader...
        }

        return result;
    }  
}

回复收藏 0 原文

嗳卜坏 2024-11-22 06:09:37

正如另一个 Stack Overflow 答案所示，最新的最佳实践似乎是向令牌流添加一个属性，然后访问该属性，而不是直接从令牌流中获取属性。为了更好地测量，您可以确保分析仪关闭。使用最新的 Lucene（当前为 v8.6.2），代码将如下所示：

String text = "foo bar";
String fieldName = "myField";
List<String> tokens = new ArrayList();
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream(fieldName, text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      tokens.add(charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

代码完成后，tokens 将包含已解析令牌的列表。

另请参阅： Lucene分析概述。

警告：我刚刚开始编写 Lucene 代码，所以我没有很多 Lucene 经验。不过，我花时间研究了最新的文档和相关帖子，并且我相信我放在这里的代码遵循最新推荐的做法，比当前的答案稍好一些。

The latest best practices, as another Stack Overflow answer indicates, seems to be to add an attribute to the token stream and later access that attribute, rather than getting an attribute directly from the token stream. And for good measure you can make sure the analyzer gets closed. Using the very latest Lucene (currently v8.6.2) the code would look like this:

String text = "foo bar";
String fieldName = "myField";
List<String> tokens = new ArrayList();
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream(fieldName, text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      tokens.add(charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

After that code is finished, tokens will contain a list of parsed tokens.

See also: Lucene Analysis Overview.

Caveat: I'm just starting to write Lucene code, so I don't have a lot of Lucene experience. I have taken the time to research the latest documentation and related posts, however, and I believe that the code I've placed here follows the latest recommended practices slightly better than the current answers.

回复收藏 0 原文

无声无音无过去 2024-11-22 06:09:37

使用 try-with-resources 效果会更好！这样您就不必显式调用更高版本的库中所需的 .close() 。

public static List<String> tokenizeString(Analyzer analyzer, String string) {
  List<String> tokens = new ArrayList<>();
  try (TokenStream tokenStream  = analyzer.tokenStream(null, new StringReader(string))) {
    tokenStream.reset();  // required
    while (tokenStream.incrementToken()) {
      tokens.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
    new RuntimeException(e);  // Shouldn't happen...
  }
  return tokens;
}

以及分词器版本：

  try (Tokenizer standardTokenizer = new HMMChineseTokenizer()) {
    standardTokenizer.setReader(new StringReader("我说汉语说得很好"));
    standardTokenizer.reset();
    while(standardTokenizer.incrementToken()) {
      standardTokenizer.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
      new RuntimeException(e);  // Shouldn't happen...
  }

Even better by using try-with-resources! This way you don't have to explicitly call .close() that is required in higher versions of the library.

public static List<String> tokenizeString(Analyzer analyzer, String string) {
  List<String> tokens = new ArrayList<>();
  try (TokenStream tokenStream  = analyzer.tokenStream(null, new StringReader(string))) {
    tokenStream.reset();  // required
    while (tokenStream.incrementToken()) {
      tokens.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
    new RuntimeException(e);  // Shouldn't happen...
  }
  return tokens;
}

And the Tokenizer version:

  try (Tokenizer standardTokenizer = new HMMChineseTokenizer()) {
    standardTokenizer.setReader(new StringReader("我说汉语说得很好"));
    standardTokenizer.reset();
    while(standardTokenizer.incrementToken()) {
      standardTokenizer.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
      new RuntimeException(e);  // Shouldn't happen...
  }

回复收藏 0 原文

~没有更多了~