Solr 同义词替换失败?

发布于 2024-12-17 02:52:32 字数 2672 浏览 0 评论 0原文

我有一个使用同义词文件的 SynonymFilterFactory 。来自 Solr 文档:

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS.  These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit

但是,当查询 sea biscuit 时,我最终得到与 seabiscuitseabiscuit 相关的结果代码>.

就好像我有以下配置(使用expand="true"):

sea biscuit, sea biscit, seabiscuit

我不理解这种行为,因为在Solr分析工具中,当查询< code>sea biscuit 它只能被 seabiscuit 正确替换。

换句话说:使用 => 进行显式同义词映射不起作用


编辑:字段配置

标记化:true

类名:org.apache.solr.schema.TextField

索引分析器:org.apache. solr.analysis.TokenizerChain

  • Tokenizer 类:org.apache.solr.analysis.WhitespaceTokenizerFactory

过滤器:

org.apache.solr.analysis.StopFilterFactory args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}

查询 分析器: org.apache.solr.analysis.TokenizerChain

  • 分词器类:org.apache.solr.analysis.WhitespaceTokenizerFactory

过滤器:

org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SynonymFilterFactory args:{expand: true ignoreCase: true synonyms: synonyms.txt }
org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 0 catenateNumbers: 0 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}

I have a SynonymFilterFactory using a synonym file. From the Solr documentation:

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS.  These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit

However, when querying sea biscuit, I end up with results related to sea, biscuit and seabiscuit.

This is as if I had the following configuration (with expand="true"):

sea biscuit, sea biscit, seabiscuit

I don't understand this behavior, because in the Solr analysis tool, when querying sea biscuit it is properly replaced by seabiscuit only.

In other words: explicit synonym mapping with => doesn't work.


Edit: field configuration

Tokenized: true

Class Name: org.apache.solr.schema.TextField

Index Analyzer: org.apache.solr.analysis.TokenizerChain

  • Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.StopFilterFactory args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}

Query Analyzer: org.apache.solr.analysis.TokenizerChain

  • Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.LowerCaseFilterFactory args:{}
org.apache.solr.analysis.SynonymFilterFactory args:{expand: true ignoreCase: true synonyms: synonyms.txt }
org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true }
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 catenateWords: 0 catenateNumbers: 0 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: protwords.txt }
org.apache.solr.analysis.LengthFilterFactory args:{min: 2 max: 500 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
org.apache.solr.analysis.ASCIIFoldingFilterFactory args:{}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

挥剑断情 2024-12-24 02:52:33

您正在执行短语查询(使用双引号)吗?
如果没有,您将向 SynonymFilter 提供两个不同的标记(sea 和 biscuit)。在这种情况下,找不到匹配的同义词。

顺便说一句,在索引时处理同义词几乎总是一个更好的主意。看这里: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory< /a>

Are you doing a phrase query (using double-quotes) ?
If not, you are giving two different tokens to the SynonymFilter (sea and biscuit). In that case, no matching synonym is found.

By the way, it's almost always a better idea to handle synonyms at index time. Look here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

丶视觉 2024-12-24 02:52:32

SynonymFilterFactory has been deprecated and should now be replaced with SynonymGraphFilterFactory. It squashes tokens and fixes issues with multi-word synonyms when more than one token exist at the same position.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文