用于比较词汇相似度的数字哈希
是否有某种形式的哈希算法可以为相似的单词生成相似的数值?我想会有很多误报,但这似乎对搜索修剪有用。
编辑:Soundex 很简洁,可能会派上用场,但理想情况下,我想要一些行为如下的东西: abs(f('horse') - f('hoarse'))
abs(f('horse') - f('hoarse'))
abs(f('马') - f('山羊'))
Is there some form of hashing algorithm that produces similar numerical values for similar words? I imagine there would be a number of false positives, but it seems like something that could be useful for search pruning.
EDIT: Soundex is neat and may come in handy, but ideally, I want something that behave something like this: abs(f('horse') - f('hoarse')) < abs(f('horse') - f('goat'))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Soundex 算法生成与输入单词中的音素相对应的键字符串。 http://www.archives.gov/research/census/soundex.html
如果您只想比较字符串之间的相似度,请尝试 Levenstein Distance。 http://en.wikipedia.org/wiki/Levenshtein_distance
The Soundex algorithm generates strings of keys corresponding to the phonemes in the input word. http://www.archives.gov/research/census/soundex.html
If you only want to compare similarity between strings, try Levenstein Distance. http://en.wikipedia.org/wiki/Levenshtein_distance
您所说的称为局部敏感哈希。它可以应用于不同类型的输入(图像、音乐、文本、空间位置,无论您需要什么)。
不幸的是(尽管进行了搜索)我找不到任何字符串 LSH 算法的实际实现。
What you are talking about is called Locality-sensitive Hashing. It can be applied to different types of input (images, music, text, positions in space, whatever you need).
Unfortunately (and despite searching) I couldn't find any practical implementation of an LSH algorithm for strings.
您随时可以尝试 Soundex 看看它是否符合您的需求。
You could always try Soundex and see if it fits your needs.