配置 lucene.net 来识别同音词
我们有一个网站,用户可以在其中输入城市名称。 Lucene.net 2.1.0.3 是用于查找已创建城市的搜索引擎。根据配置,Lucene 无法识别 Saint Jerome 与 St. Jerome 相同,或者 Lake Phillip 与 Lac Phillip 相同。
关于扩大 Lucene.Net 的搜索策略有什么建议吗?
we have a site where the user can enter the name of a city. Lucene.net 2.1.0.3 is the search engine to look for cities that have already been created. As configured Lucene does not recognise that Saint Jerome is the same as St. Jerome or that Lake Phillip is the same as Lac Phillip.
Any tips on widening the search strategy for Lucene.Net?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我读过一些关于这个同义词和“听起来像”的内容(读“我目前对此没有经验”)。对我来说,这似乎是两个不同的问题:缩写“同义词”和“听起来像”。
听起来像
Soundex 是一种较旧的算法,专为“美国”名字的拼写错误而设计。有一种名为“Double Metaphone”的改进算法解决了 Soundex 的一些抱怨。这个库看起来很有前途:
http://sourceforge.net/projects/phonetixnet/
同义词
缩写 似乎可能有一个通用的同义词系统,我希望“花园城市”可能会得到“Plot Town”或“Patch burg”的同义词。我猜您会使用自己的特定于域的同义词获得更好的结果。
似乎像“Saint”(“St.”)和“Mount”(“Mt”)这样的词最好作为同义词处理。这是一篇文章,提出了一个相当简单的自定义同义词解决方案: http://www.codeproject .com/KB/cs/lucene_custom_analyzer.aspx 。
I've read a bit about this synonyming and "sounds like" (read "I currently have no experience with this"). To me it seems like two different problems: abbreviation "synonyms" and "sounds like".
Sounds Like
Soundex is an older algorithm which was designed for mispellings of "american" names. There is an improved algorithm called 'Double Metaphone' addressed some of the complaints of Soundex. This library looks promising:
http://sourceforge.net/projects/phonetixnet/
Abbreviation Synonyms
While it seems there could be a generic synonyming system, I would expect "Garden City" might get synonyms of "Plot Town" or "Patch burg". I am guessing you'll achieve better results with your own domain-specific synonyms.
It seems that words like 'Saint' ('St.') and 'Mount' ('Mt') would be best handled as synonyms. Here is an article that proposes a fairly simple solution to custom synonyming: http://www.codeproject.com/KB/cs/lucene_custom_analyzer.aspx .