PatternTokenizerFactory 和停用词
solr/lucene 中名为 COLORS 的文档字段具有如下组单词:
field1: blue/dark red/green 字段2:蓝色/黄色/橙色 [...]
我需要对其进行分面搜索以获取所有颜色和每种颜色的数量。 首先,我尝试了 PatternTokenizerFactory,然后是停用词列表:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="/" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords"
enablePositionIncrements="true"
/>
</analyzer>
不幸的是,停用词列表接缝被忽略。停用词出现在分面搜索结果中。
这个SO问题描述了同样的问题。不幸的是,发布的解决方案对我不起作用,因为我无法使用 solr.StandardTokenizerFactory,因为标准标记生成器也在空白空间上分割标记。这意味着“深红色”变成了“暗”和“红色”,这是错误的。
有什么方法可以使用模式标记器吗?
感谢您的任何帮助!
an document field in solr/lucene called COLORS has group of words like this:
field1: blue/dark red/green
field2: blue/yellow/orange
[...]
I need to run an faceted search over that to get all the colors and the count of each color.
First I tried the PatternTokenizerFactory, followd by the stopword-list:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="/" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords"
enablePositionIncrements="true"
/>
</analyzer>
Unfortunately the stopword list seams to be ignored. Stopwords are showing up in faceted search result.
This SO question describes the same problem. Unfortunately the posted solution doen't work for me, because i can not use the solr.StandardTokenizerFactory, because the standard tokenizer also split tokens on whitspaces. That means "dark red" becomes "dark" and "red" which is wrong.
Is there any way to use the pattern tokenizer?
Thnak you for any kind of help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
供您参考:facet、模式分词器和停用词将在 lucene / solr 4 中工作:-)
For your information: facet, pattern tokenizer and stopwords will work in lucene / solr 4 :-)