Lucene 分析器的比较
有人可以解释一下 Lucene 中不同分析器之间的区别吗?我收到 maxClauseCount 异常,我知道可以通过使用 KeywordAnalyzer 来避免这种情况,但我不想在不了解分析器周围问题的情况下从 StandardAnalyzer 进行更改。非常感谢。
Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一般来说,Lucene 中的任何分析器都是分词器 + 词干分析器 + 停用词过滤器。
分词器将文本分割成块,并且由于不同的分析器可能使用不同的分词器,因此您可以获得不同的输出标记流,即文本块的序列。例如,您提到的
KeywordAnalyzer
根本不拆分文本并将所有字段视为单个标记。同时,StandardAnalyzer(以及大多数其他分析器)使用空格和标点符号作为分割点。例如,对于短语“I am very happy”,它将生成列表 [“i”、“am”、“very”、“happy”](或类似的内容)。有关特定分析器/标记器的更多信息,请参阅其 Java 文档 。词干分析器用于获取相关单词的词根。这在很大程度上取决于所使用的语言。例如,对于英语中的前一个短语,将产生类似 ["i", "be", "veri", "happi"] 的内容,对于法语“Je suis très heureux”,会产生某种法语分析器(例如
SnowballAnalyzer
,用“French”初始化)将产生[“je”,“être”,“tre”,“heur”]。当然,如果您将使用一种语言的分析器来对另一种语言的文本进行词干分析,则将使用另一种语言的规则,并且词干分析器可能会产生不正确的结果。这并不是整个系统都失败了,但搜索结果可能不太准确。KeywordAnalyzer
不使用任何词干分析器,它会未经修改地传递所有字段。因此,如果您要搜索英文文本中的某些单词,那么使用此分析器并不是一个好主意。停用词是最常见且几乎无用的词。同样,这在很大程度上取决于语言。对于英语,这些单词是“a”、“the”、“I”、“be”、“have”等。停用词过滤器将它们从标记流中删除,以降低搜索结果中的噪音,所以最后我们的短语“I” 'm very happy" 与
StandardAnalyzer
将转换为列表 ["veri", "happi"]。而
KeywordAnalyzer
又什么也不做。因此,KeywordAnalyzer
用于 ID 或电话号码等内容,但不用于通常的文本。至于您的
maxClauseCount
异常,我相信您在搜索时会得到它。在这种情况下,很可能是因为搜索查询过于复杂。尝试将其拆分为多个查询或使用更多低级函数。In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example,
KeywordAnalyzer
you mentioned doesn't split the text at all and takes all the field as a single token. At the same time,StandardAnalyzer
(and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like
SnowballAnalyzer
, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.KeywordAnalyzer
doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with
StandardAnalyzer
will be transformed to list ["veri", "happi"].And
KeywordAnalyzer
again does nothing. So,KeywordAnalyzer
is used for things like ID or phone numbers, but not for usual text.And as for your
maxClauseCount
exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.在我看来,我使用过
StandAnalyzer
和SmartCNAnalyzer
。因为我必须搜索中文文本。显然,SmartCnAnalyzer
更擅长处理中文。对于不同的目的,您必须选择最合适的分析仪。In my perspective, I have used
StandAnalyzer
andSmartCNAnalyzer
. As I have to search text in Chinese. Obviously,SmartCnAnalyzer
is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.