过去几天我一直在研究 soundex、metaphone 和其他字符串搜索技术,据我了解,这两种算法在处理音译为英语的非英语单词时效果很好。
然而,我的要求是这样的搜索能够在原始的、非音译的语言中工作,适应德语、挪威语甚至西里尔字母等字母表。
有没有能够完全处理这些字母的搜索算法?或者我是否最好使用第三方全文搜索库(例如 Lucene)?因此,问题就变成了“Lucene 是否处理非英语字母?”
I've been studying soundex, metaphone and other string search techniques the past few days, and in my understanding both algorithms work well in handling non-English words transliterated to English.
However the requirement that I have would be for such search to work in the original, untransliterated languages, accomodating alphabets such as German, Norwegian, and even Cyrilic alphabets.
Are there any search algorithms capable of handling these alphabets completely? Or am I better off using third party full-text-search libraries such as Lucene? Consequently, the question then becomes 'does Lucene handle non-English alphabets?'
发布评论
评论(2)
我不是这方面的专家,但你的要求对我来说似乎很难。 Soundex 专为英语声音和字符而设计。我认为它对于非英语语言表现不佳。例如,请参阅对此相关问题的回复。
Double-Metaphone 是一种尝试处理比 Soundex 或 Metaphone 更复杂的变体,并且旨在处理多种语言的违规行为。这可能足以满足您的需求。链接页面上有一个库实现的列表。
Lucene 中对其他语言的支持基于 分析器。 Lucene 附带了一组针对不同语言的分析器(虽然我找不到默认列表),但质量可能相当可变。
I'm not an expert in this area, but your requirements seem quite difficult to me. Soundex was specifically designed for English sounds as well as characters. I don't think it will perform well for non-English languages. See for example the responses to this related question.
Double-Metaphone is an attempt to deal with much more complex variations than Soundex or Metaphone, and was designed to handle irregularities in a range of languages. It might be sufficient for your needs. There is a list of library implementations on the linked page.
Support for other languages in Lucene is based on the concept of Analyzers. Lucene comes with a set of analyzers for different languages (although I couldn't find the default list), but the quality may be quite variable.
维基百科上有一些很好的参考资料,从 Soundex 文章开始。我不知道是否有现有的库可以处理如此广泛的语言。
There are some good references on Wikipedia, starting from the Soundex article. I don't know whether there are existing libraries designed to handle such a wide variety of languages.