如何在 Solr 中配置词干提取?
我添加到 solr 索引:“美国”。当我搜索“美国”时,没有结果。
应该如何配置 schema.xml 才能获得结果?
当前配置:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I add to solr index: "American". When I search by "America" there is no results.
How should schema.xml be configured to get results?
current configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为什么要有两个词干提取器?
尝试从两种分析器类型中删除
EnglishPorterFilterFactory
(已弃用),重建索引,然后尝试搜索 American 是否会产生 America。如果这不起作用,您可以尝试的另一件事是删除两个词干分析器过滤器并添加带有
language="English"
的SnowballPorterFilterFactory
。Why would you have two stemmers?
Try removing
EnglishPorterFilterFactory
(deprecated) from both of your analyzer types, rebuild the index and then try whether search for American will yield America.If that wont work, the other thing you can try is to remove both of your stemmer filters and add
SnowballPorterFilterFactory
withlanguage="English"
instead.您必须为
分析器
使用一个词干分析器,并且EnglishPorterFilterFactory
已被弃用,正如@Marko已经提到的那样。所以你应该从分析器中删除这个。我将 SnowballPorterFilterFactory 用于索引和查询分析器 -
fieldType 定义非常不言自明,但以防万一:
Tokenizer solr.WhitespaceTokenizerFactory:此操作将使用空格作为分隔符将句子分解为单词。
过滤器 solr.SnowballPorterFilterFactory:此过滤器将对每个单词(标记)应用词干算法。在上面的示例中,我选择了 Snowball Porter 词干算法。 Solr 提供了一些流行的词干算法的实现。
您可以浏览其他几种词干算法,例如 HunspellStemFilterFactory、KStemFilterFactory 也是如此。
You have to use one stemmer for an
analyzer
andEnglishPorterFilterFactory
is deprecated as @Marko already mentioned. So you should remove this one from analyzers.I used SnowballPorterFilterFactory for both index and query analyzers -
The fieldType definition is pretty self explanatory, but just in case:
Tokenizer solr.WhitespaceTokenizerFactory: This operation will break up the sentences into words, using whitespaces as delimiters.
Filter solr.SnowballPorterFilterFactory: This filter will apply a stemming algorithm to each word (token). In the example above I have chosen the Snowball Porter stemming algorithm. Solr provides a few implementation of popular stemming algorithms.
You can browse several other stemming algorithms e.g. HunspellStemFilterFactory, KStemFilterFactory too.