如何在 solr 中的过滤器之间使用分词器?
我想使用一种模式,其中在一个过滤器之后调用空白标记生成器,然后应用所有其他过滤器:
<filter class="solr.SynonymFilterFactory" tokenizerFactory="solr.KeywordTokenizerFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"
preserveOriginal="1"
/>
Solr 仅应用过滤器之间的新顺序,但在每个过滤器之前调用标记生成器...
有人知道吗?
最好的问候,希乔兰
I want to use a schema where the whitespace-tokenizer ist called after one filter and after that all others filters shall be applied:
<filter class="solr.SynonymFilterFactory" tokenizerFactory="solr.KeywordTokenizerFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"
preserveOriginal="1"
/>
Solr only applies the new order between filters, but the tokenizer is called before every filter...
Has anybody a clue?
Best regards, hijolan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
默认情况下,在过滤器之前运行分词器。它更像是这样——这就是 solr 的工作方式。但是您可以在分词器之前添加特殊类型的过滤器,例如 solr.MappingCharFilterFactory 。
我想说的是:这取决于过滤器,如果它在分词器之前工作。查看 CharFilter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
如果您需要以更复杂的方式“过滤”术语,就像 solr.WhitespaceTokenizerFactory 那样,请尝试使用不同的分词器,如 solr.PatternTokenizerFactory (solr.PatternTokenizerFactory )
Runnig the tokenizer before an filter is default. It's more like this - is the way, solr works. But you can add special kind of filters before the tokenizer, for example the
solr.MappingCharFilterFactory
.What i'm trying to say: it depends on the filter, if it works before the tokenizer. Look at the CharFilter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
If you need to "filter" the terms in an more complex way, like
solr.WhitespaceTokenizerFactory
does, try to use an different tokenizer, likesolr.PatternTokenizerFactory
(solr.PatternTokenizerFactory)