有哪些字符串相似度算法?

发布于 2024-09-16 02:47:29 字数 552 浏览 12 评论 0原文

我需要比较两个字符串并计算它们的相似度,以过滤出最相似字符串的列表。

搜索“dog”

  1. dogdognebogfoggy
  2. 返回
  3. 例如

例如 搜索“crack”会返回

  1. crack
  2. smartcrackrack
  3. jack
  4. :< a
  5. quack

我遇到过的

  • href="http://orderedlist.com/our-writing/blog/articles/live-search-with-quicksilver-style/" rel ="nofollow noreferrer">QuickSilver
  • LiquidMetal

还有哪些其他字符串相似度算法?

I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.

e.g. searching for "dog" would return

  1. dog
  2. doggone
  3. bog
  4. fog
  5. foggy

e.g. searching for "crack" would return

  1. crack
  2. wisecrack
  3. rack
  4. jack
  5. quack

I have come across:

What other string similarity algorithms are there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

何止钟意 2024-09-23 02:47:44

我将 Levenshtein 距离与 soundex 索引结合使用,尝试对拼写错误的药物名称进行模糊匹配,这些名称总是相当冗长和尴尬。 Soundex 有点以英语为中心,因此您需要其他语言的东西。

I have used Levenshtein distance together with a soundex index, to try and come up with a fuzzy match to misspelled pharmaceutical names that are always quite lengthy and awkward. Soundex is a bit English centric so you would need something else for other languages.

○闲身 2024-09-23 02:47:43
class Program { 
    static int ComputeLevenshteinDistance(string source, string target) {
        if ((source == null) || (target == null)) return 0;
        if ((source.Length == 0) || (target.Length == 0)) return 0;
        if (source == target) return source.Length;

        int sourceWordCount = source.Length;
        int targetWordCount = target.Length;

        int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];

        // Step 2
        for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
        for (int j = 0; j <= targetWordCount; distance[0, j] = j++);

        for (int i = 1; i <= sourceWordCount; i++) {
            for (int j = 1; j <= targetWordCount; j++) {
                // Step 3
                int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;

                // Step 4
                distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
            }
        }

        return distance[sourceWordCount, targetWordCount]; 
    }

    static void Main(string[] args){ 
       Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
       Console.ReadKey();
    }
}
class Program { 
    static int ComputeLevenshteinDistance(string source, string target) {
        if ((source == null) || (target == null)) return 0;
        if ((source.Length == 0) || (target.Length == 0)) return 0;
        if (source == target) return source.Length;

        int sourceWordCount = source.Length;
        int targetWordCount = target.Length;

        int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];

        // Step 2
        for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
        for (int j = 0; j <= targetWordCount; distance[0, j] = j++);

        for (int i = 1; i <= sourceWordCount; i++) {
            for (int j = 1; j <= targetWordCount; j++) {
                // Step 3
                int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;

                // Step 4
                distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
            }
        }

        return distance[sourceWordCount, targetWordCount]; 
    }

    static void Main(string[] args){ 
       Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
       Console.ReadKey();
    }
}
温暖的光 2024-09-23 02:47:42

您可以这样做:

Foreach string in haystack Do
    offset := -1;
    matchedCharacters := 0;
    Foreach char in needle Do
        offset := PositionInString(string, char, offset+1);
        If offset = -1 Then
            Break;
        End;
        matchedCharacters := matchedCharacters + 1;
    End;
    If matchedCharacters > 0 Then
       // (partial) match found
    End;
End;

使用 matchedCharacters 您可以确定匹配的“程度”。如果它等于needle的长度,则needle中的所有字符也在string中。如果还存储了第一个匹配字符的偏移量,那么还可以通过用最后一个匹配字符的偏移量减去第一个匹配字符的偏移量来按匹配字符的“密度”对结果进行排序 offset;差异越小,匹配越密集。

You could do this:

Foreach string in haystack Do
    offset := -1;
    matchedCharacters := 0;
    Foreach char in needle Do
        offset := PositionInString(string, char, offset+1);
        If offset = -1 Then
            Break;
        End;
        matchedCharacters := matchedCharacters + 1;
    End;
    If matchedCharacters > 0 Then
       // (partial) match found
    End;
End;

With matchedCharacters you can determine the “degree” of the match. If it is equal to the length of needle, all characters in needle are also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.

小红帽 2024-09-23 02:47:40

如果重点是性能,我会实现一个基于 trie 结构
(非常适合在文本中查找单词,或帮助纠正单词,但例如,在您的情况下,您可以快速找到包含给定单词的所有单词或除一个字母外的所有单词)。

请首先关注上面的维基百科链接。Tries 是最快的单词排序方法(n 个单词,搜索 s,O(n< /em>) 创建 trie,O(1) 搜索 s (或者如果您愿意,如果 a 是平均长度,则 O(an< /em>) 用于 trie 和 O(s) 用于搜索))。

(相似词)的快速且简单的实现(待优化)包括:

  • 问题 em>trie 包含单词列表,所有字母都在前面和后面编入索引(请参见下面的示例)
  • 要搜索 s,请从 s[0] 迭代到在 trie 中查找单词,然后 s[1] 等...
  • 在 trie 中,如果找到的字母数为 len(s)-k 显示单词,其中 k 是容差(缺少 1 个字母,2...)。
  • 该算法可以扩展到列表中的单词(见下文)

例如,单词carvars

构建特里树(大字母意味着一个单词在这里结束,而另一个单词可能继续)。 > 是后索引(前进),< 是前索引(后退)。在另一个例子中,我们可能还必须指示起始字母,为了清楚起见,这里没有显示它。
例如,C++ 中的 <>Mystruct *previous,*next,意思是 a > c< r,您可以直接从 a 转到 c,反之,也可以从 a 转到 R >。

  1.  c < a < R
  2.  a > c < R
  3.    > v < r < S
  4.  R > a > c
  5.        > v < S
  6.  v < a < r < S
  7.  S > r > a > v

严格查找car,特里树为您提供从 1. 开始的访问权限,并且您找到 car(您还会找到以 car 开头的所有内容,但是还有任何包含汽车的东西 - 它不在示例中 - 但例如可以从 c > i > v < a < R 中找到。

要在允许 1 个字母错误/缺失容差的情况下进行搜索,您可以从 s 的每个字母进行迭代,并计算从 s< 获得的连续字母(或跳过 1 个字母)的数量/em> 在特里树中。

寻找 car,

  • c:在 trie 中搜索 c c ac < rs 中缺少字母)。要接受单词中的错误字母 w,请尝试在每次迭代时跳转到错误的字母,看看 ar 是否在后面,这是 O(w em>)。使用两个字母,O(w²) 等...但是可以将另一级索引添加到特里树中,以考虑字母上的跳转 - 制作特里树复杂,并且对记忆贪婪。
  • a,然后 r:与上面相同,但也向后搜索

这只是为了提供有关原理的想法 - 上面的示例可能有一些故障(我将明天再检查一下)。

If the focus is on performance, I would implement an algorithm based on a trie structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).

Please follow first the wikipedia link above.Tries is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).

A fast and easy implementation (to be optimized) of your problem (similar words) consists of

  • Make the trie with the list of words, having all letters indexed front and back (see example below)
  • To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
  • In the trie, if the number of letters found is len(s)-k the word is displayed, where k is the tolerance (1 letter missing, 2...).
  • The algorithm may be extended to the words in the list (see below)

Example, with the words car, vars.

Building the trie (big letter means a word end here, while another may continue). The > is post-index (go forward) and < is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The < and > in C++ for instance would be Mystruct *previous,*next, meaning from a > c < r, you can go directly from a to c, and reversely, also from a to R.

  1.  c < a < R
  2.  a > c < R
  3.    > v < r < S
  4.  R > a > c
  5.        > v < S
  6.  v < a < r < S
  7.  S > r > a > v

Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from c > i > v < a < R).

To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.

looking for car,

  • c: searching the trie for c < a and c < r (missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see if ar is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.
  • a, then r: same as above, but searching backwards as well

This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).

海拔太高太耀眼 2024-09-23 02:47:38

看来您需要某种模糊匹配。这是一些相似性度量集的java实现 http://www.dcs.shef.ac.uk/~sam/stringmetrics.html。以下是字符串指标的更详细说明 http://www. cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf 这取决于您的实施必须有多模糊和多快。

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

才能让你更想念 2024-09-23 02:47:37

Levenshtein 距离是我推荐的算法。它计算将一个字符串更改为另一个字符串所需执行的最少操作次数。变化越少意味着字符串越相似......

The Levenshtein distance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文