Solr 同义词替换失败?
我有一个使用同义词文件的 SynonymFilterFactory 。来自 Solr 文档:
#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS. These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit
但是,当查询 sea biscuit
时,我最终得到与 sea
、biscuit
和 seabiscuit
相关的结果代码>.
这就好像我有以下配置(使用expand="true"
):
sea biscuit, sea biscit, seabiscuit
我不理解这种行为,因为在Solr分析工具中,当查询< code>sea biscuit 它只能被 seabiscuit
正确替换。
换句话说:使用 =>
进行显式同义词映射不起作用。
编辑:字段配置
标记化:true
类名:org.apache.solr.schema.TextField
索引分析器:org.apache. solr.analysis.TokenizerChain
- Tokenizer 类:
org.apache.solr.analysis.WhitespaceTokenizerFactory
过滤器:
org.apache.solr.analysis.StopFilterFactory args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}
查询 分析器: org.apache.solr.analysis.TokenizerChain
- 分词器类:org.apache.solr.analysis.WhitespaceTokenizerFactory
过滤器:
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SynonymFilterFactory args:{expand: true ignoreCase: true synonyms: synonyms.txt }
org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 0 catenateNumbers: 0 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}
I have a SynonymFilterFactory using a synonym file. From the Solr documentation:
#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS. These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit
However, when querying sea biscuit
, I end up with results related to sea
, biscuit
and seabiscuit
.
This is as if I had the following configuration (with expand="true"
):
sea biscuit, sea biscit, seabiscuit
I don't understand this behavior, because in the Solr analysis tool, when querying sea biscuit
it is properly replaced by seabiscuit
only.
In other words: explicit synonym mapping with =>
doesn't work.
Edit: field configuration
Tokenized: true
Class Name: org.apache.solr.schema.TextField
Index Analyzer: org.apache.solr.analysis.TokenizerChain
- Tokenizer Class:
org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:
org.apache.solr.analysis.StopFilterFactory args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}
Query Analyzer: org.apache.solr.analysis.TokenizerChain
- Tokenizer Class:
org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SynonymFilterFactory args:{expand: true ignoreCase: true synonyms: synonyms.txt }
org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 0 catenateNumbers: 0 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您正在执行短语查询(使用双引号)吗?
如果没有,您将向 SynonymFilter 提供两个不同的标记(sea 和 biscuit)。在这种情况下,找不到匹配的同义词。
顺便说一句,在索引时处理同义词几乎总是一个更好的主意。看这里: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory< /a>
Are you doing a phrase query (using double-quotes) ?
If not, you are giving two different tokens to the SynonymFilter (sea and biscuit). In that case, no matching synonym is found.
By the way, it's almost always a better idea to handle synonyms at index time. Look here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
SynonymFilterFactory 已被弃用,现在应替换为 SynonymGraphFilterFactory< /a>.当同一位置存在多个标记时,它会压缩标记并修复多词同义词的问题。
SynonymFilterFactory has been deprecated and should now be replaced with SynonymGraphFilterFactory. It squashes tokens and fixes issues with multi-word synonyms when more than one token exist at the same position.