Solr Ngram 匹配灾难
这是我的(相当标准的)ngram 模式——
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
现在laptop_ngram:"g74sx-a" 返回——
<arr name="laptop_ngram">
<str>ASUS G74SX-A1 17.3-Inch Gaming Laptop</str>
</arr>
但是laptop_ngram:"g74sx-a1" 什么也没找到。
顺便说一句,转义“-”没有任何区别。
有什么想法吗?
This is my (pretty standard) ngram schema --
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
now laptop_ngram:"g74sx-a" returns --
<arr name="laptop_ngram">
<str>ASUS G74SX-A1 17.3-Inch Gaming Laptop</str>
</arr>
but laptop_ngram:"g74sx-a1" finds nothing.
BTW, escaping the "-" does not make any difference.
Any thought?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
StandardTokenizerFactory 可能会对这个术语做一些事情。您可以在分析页面中检查这一点。
因此更改为 WhitespaceTokenizerFactory 可以解决该问题。
The StandardTokenizerFactory might do something to the term. You can check this in the analysis page.
So changing to WhitespaceTokenizerFactory could fix the problem.
感谢 O. Klein,他给我指明了新的方向。
我最终选择了 WhitespaceTokenizerFactory 加上 WordDelimiterFilterFactory ——
它适用于“g74sx”、“g74sx-”、“g74sx-a”和“g74sx-a1”
但是,旅程并没有结束,因为我仍在探索原因--
“G74SX-XA1”与“g74sx-x”和“g74sx-xa1”一起找到,但不是“g74sx-xa”...
Thanks to O. Klein, who showed me new direction.
I finally settle with WhitespaceTokenizerFactory plus WordDelimiterFilterFactory --
which works for "g74sx", "g74sx-", "g74sx-a", and "g74sx-a1"
However, the journey didn't end here, as I'm still exploring why --
"G74SX-XA1" is found with "g74sx-x" and "g74sx-xa1", but not "g74sx-xa"...