有哪些字符串相似度算法?
我需要比较两个字符串并计算它们的相似度,以过滤出最相似字符串的列表。
搜索“dog”
- dogdognebogfoggy
- 返回
- ,
- 例如
- 将
例如 搜索“crack”会返回
- crack
- smartcrackrack
- jack
- :< a
- quack
我遇到过的
- href="http://orderedlist.com/our-writing/blog/articles/live-search-with-quicksilver-style/" rel ="nofollow noreferrer">QuickSilver
- LiquidMetal
还有哪些其他字符串相似度算法?
I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.
e.g. searching for "dog" would return
- dog
- doggone
- bog
- fog
- foggy
e.g. searching for "crack" would return
- crack
- wisecrack
- rack
- jack
- quack
I have come across:
What other string similarity algorithms are there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我将 Levenshtein 距离与 soundex 索引结合使用,尝试对拼写错误的药物名称进行模糊匹配,这些名称总是相当冗长和尴尬。 Soundex 有点以英语为中心,因此您需要其他语言的东西。
I have used Levenshtein distance together with a soundex index, to try and come up with a fuzzy match to misspelled pharmaceutical names that are always quite lengthy and awkward. Soundex is a bit English centric so you would need something else for other languages.
您可以这样做:
使用 matchedCharacters 您可以确定匹配的“程度”。如果它等于needle的长度,则needle中的所有字符也在string中。如果还存储了第一个匹配字符的偏移量,那么还可以通过用最后一个匹配字符的偏移量减去第一个匹配字符的偏移量来按匹配字符的“密度”对结果进行排序 offset;差异越小,匹配越密集。
You could do this:
With matchedCharacters you can determine the “degree” of the match. If it is equal to the length of needle, all characters in needle are also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.
如果重点是性能,我会实现一个基于
trie
结构(非常适合在文本中查找单词,或帮助纠正单词,但例如,在您的情况下,您可以快速找到包含给定单词的所有单词或除一个字母外的所有单词)。
请首先关注上面的维基百科链接。
Tries
是最快的单词排序方法(n 个单词,搜索 s,O(n< /em>) 创建 trie,O(1) 搜索 s (或者如果您愿意,如果 a 是平均长度,则 O(an< /em>) 用于 trie 和 O(s) 用于搜索))。(相似词)的快速且简单的实现(待优化)包括:
例如,单词
car
,vars
。构建特里树(大字母意味着一个单词在这里结束,而另一个单词可能继续)。
>
是后索引(前进),<
是前索引(后退)。在另一个例子中,我们可能还必须指示起始字母,为了清楚起见,这里没有显示它。例如,C++ 中的
<
和>
为Mystruct *previous,*next
,意思是a > c< r
,您可以直接从a
转到c
,反之,也可以从a
转到R
>。严格查找car,特里树为您提供从 1. 开始的访问权限,并且您找到 car(您还会找到以 car 开头的所有内容,但是还有任何包含汽车的东西 - 它不在示例中 - 但例如可以从
c > i > v < a < R
中找到。要在允许 1 个字母错误/缺失容差的情况下进行搜索,您可以从 s 的每个字母进行迭代,并计算从 s< 获得的连续字母(或跳过 1 个字母)的数量/em> 在特里树中。
寻找
car
,c
:在 trie 中搜索c
c
a
和c < r
(s 中缺少字母)。要接受单词中的错误字母 w,请尝试在每次迭代时跳转到错误的字母,看看ar
是否在后面,这是 O(w em>)。使用两个字母,O(w²) 等...但是可以将另一级索引添加到特里树中,以考虑字母上的跳转 - 制作特里树复杂,并且对记忆贪婪。a
,然后r
:与上面相同,但也向后搜索这只是为了提供有关原理的想法 - 上面的示例可能有一些故障(我将明天再检查一下)。
If the focus is on performance, I would implement an algorithm based on a
trie
structure(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).
Please follow first the wikipedia link above.
Tries
is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).A fast and easy implementation (to be optimized) of your problem (similar words) consists of
Example, with the words
car
,vars
.Building the trie (big letter means a word end here, while another may continue). The
>
is post-index (go forward) and<
is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.The
<
and>
in C++ for instance would beMystruct *previous,*next
, meaning froma > c < r
, you can go directly froma
toc
, and reversely, also froma
toR
.Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from
c > i > v < a < R
).To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.
looking for
car
,c
: searching the trie forc < a
andc < r
(missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see ifar
is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.a
, thenr
: same as above, but searching backwards as wellThis is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).
看来您需要某种模糊匹配。这是一些相似性度量集的java实现 http://www.dcs.shef.ac.uk/~sam/stringmetrics.html。以下是字符串指标的更详细说明 http://www. cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf 这取决于您的实施必须有多模糊和多快。
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
Levenshtein 距离是我推荐的算法。它计算将一个字符串更改为另一个字符串所需执行的最少操作次数。变化越少意味着字符串越相似......
The Levenshtein distance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...