Lucene 分析器的比较

发布于 2024-10-28 04:11:46 字数 140 浏览 1 评论 0原文

有人可以解释一下 Lucene 中不同分析器之间的区别吗？我收到 maxClauseCount 异常，我知道可以通过使用 KeywordAnalyzer 来避免这种情况，但我不想在不了解分析器周围问题的情况下从 StandardAnalyzer 进行更改。非常感谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枕花眠 2024-11-04 04:11:46

一般来说，Lucene 中的任何分析器都是分词器 + 词干分析器 + 停用词过滤器。

分词器将文本分割成块，并且由于不同的分析器可能使用不同的分词器，因此您可以获得不同的输出标记流，即文本块的序列。例如，您提到的KeywordAnalyzer根本不拆分文本并将所有字段视为单个标记。同时，StandardAnalyzer（以及大多数其他分析器）使用空格和标点符号作为分割点。例如，对于短语“I am very happy”，它将生成列表 [“i”、“am”、“very”、“happy”]（或类似的内容）。有关特定分析器/标记器的更多信息，请参阅其 Java 文档。

词干分析器用于获取相关单词的词根。这在很大程度上取决于所使用的语言。例如，对于英语中的前一个短语，将产生类似 ["i", "be", "veri", "happi"] 的内容，对于法语“Je suis très heureux”，会产生某种法语分析器（例如 SnowballAnalyzer ，用“French”初始化）将产生[“je”，“être”，“tre”，“heur”]。当然，如果您将使用一种语言的分析器来对另一种语言的文本进行词干分析，则将使用另一种语言的规则，并且词干分析器可能会产生不正确的结果。这并不是整个系统都失败了，但搜索结果可能不太准确。

KeywordAnalyzer 不使用任何词干分析器，它会未经修改地传递所有字段。因此，如果您要搜索英文文本中的某些单词，那么使用此分析器并不是一个好主意。

停用词是最常见且几乎无用的词。同样，这在很大程度上取决于语言。对于英语，这些单词是“a”、“the”、“I”、“be”、“have”等。停用词过滤器将它们从标记流中删除，以降低搜索结果中的噪音，所以最后我们的短语“I” 'm very happy" 与 StandardAnalyzer 将转换为列表 ["veri", "happi"]。

而 KeywordAnalyzer 又什么也不做。因此，KeywordAnalyzer 用于 ID 或电话号码等内容，但不用于通常的文本。

至于您的 maxClauseCount 异常，我相信您在搜索时会得到它。在这种情况下，很可能是因为搜索查询过于复杂。尝试将其拆分为多个查询或使用更多低级函数。

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].

And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

回复收藏 0 原文