单词的 Damerau-Levenshtein 距离
我正在寻找这样一种算法,但它可以在单词之间而不是字母之间进行替换。有这样的算法吗?
我正在寻找 SQL Server 的实现,但算法的名称就足够了。
I am looking for such an algorithm, but one that makes substitutions between words and not between letters. Is there such an algorithm?
I am looking for an implementation with SQL Server, but the name of the algorithm will be good enough.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,没有理由不能只使用现有的 Levenshtein 算法 - 只需使用单词作为符号而不是字母。
As far as I'm aware, there's no reason you can't just use the existing Levenshtein algorithm - just use words as symbols instead of letters.
如果您需要在重新排列单词时查找拼写错误,我发现取单词用最少的空格并将其分割数组重新排列成以不同顺序重新排列的相同单词的字符串,然后对新单词运行 Levenshtein,第二个短语效果很好
(2 个空格,6 个排列) !
哈佛大学医学院(3 个空格,24 种排列)
编辑距离 - 17
与“斯坦福医学院”(Levenshtein 距离 5)相比,
哈华医学院与哈沃医学院的 LD 为 6。仍然允许斯坦福大学出现错误,但排名更接近,因此您可以建立像删除单词这样的事情(在这种情况下删除“of”只会得到 3 的 LD)
按空间排列单词的性能是 O(n!),删除单词是 O(n+m),其中 n,m 是每个单词中的单词数短语,除非您想删除多个单词,反之亦然,否则您只能选择删除少于 4 个字母的单词,或介词/形容词列表中的单词等。Levenshtein
的性能在其维基百科页面上进行了解释。
If you need to find misspellings while re-arranging words, I find that taking the word with least spaces and permuting its split array back into strings of rearranged same words in different order, then running Levenshtein on the new words and the second phrase works great!
Harward Medical School (2 spaces, 6 permutations)
Harward School of Medicine (3 spaces, 24 permutations)
Levenshtein distance - 17
Compare to, say, "Stanford School of Medicine" (Levenshtein distance 5)
Harward School Medical vs. Harward School of Medicine has LD of 6. Still allowing a mistake to Stanford, but MUCH MUCH CLOSER up the rank, so you can build in things like dropping words (dropping "of" in this case gets LD of only 3)
Performance of permuting words by space is O(n!), dropping words is O(n+m) where n,m are numbers of words in each phrase, unless you want to drop more than one word, or vice versa, you can only opt to drop words with fewer than 4 letters, or words from preposition/adjective lists, etc.
Performance of Levenshtein is explained on its wikipedia page.