如何衡量两个字符串之间的相似度?

发布于 2024-07-25 11:30:00 字数 1438 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

囚你心 2024-08-01 11:30:00

有多种不同的方法可以做到这一点。 查看维基百科“字符串相似性度量”页面,获取其他包含算法的页面的链接。

然而,我不认为这些算法中的任何一个都会考虑声音 - 因此“staq 溢出”将与“堆栈溢出”和“staw 溢出”类似,尽管第一个在术语上更相似的发音。

我刚刚找到另一个页面,它提供了更多选项。特别是 Soundex 算法 (Wikipedia)可能更接近您所追求的内容。

There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.

I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.

私藏温柔 2024-08-01 11:30:00

Levenshtein 距离 可能就是您正在寻找的。

Levenshtein distance is probably what you're looking for.

许你一世情深 2024-08-01 11:30:00

这是我为我正在从事的项目编写的一些代码。 我需要知道字符串的相似率和基于字符串单词的相似率。
最后一个,我想知道最小字符串的单词相似度(因此,如果所有单词都存在并在较大字符串中匹配,则结果将为 100%)和较大字符串的单词相似度(我称之为 RealWordsRatio) )。
我使用 Levenshtein 算法来计算距离。 到目前为止,代码尚未优化,但它按预期工作。 希望对你有帮助。

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }

Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings.
This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio).
I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }
回忆追雨的时光 2024-08-01 11:30:00

不久前,我用 C# 编写了一个 Double Metaphone 实现。 您会发现它远远优于 Soundex 等。

编辑距离也被提出,它是一个有很多用途的伟大算法,但语音匹配并不是它真正的作用; 只是有时看起来是这样,因为语音相似的单词通常拼写也相似。 我对各种模糊匹配算法进行了分析,您可能也会发现它很有用。

I wrote a Double Metaphone implementation in C# a while back. You'll find it vastly superior to Soundex and the like.

Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.

只怪假的太真实 2024-08-01 11:30:00

为了处理“声音相似”的问题,您可能需要考虑使用语音算法(例如 Double Metaphone 或 soundex)进行编码。 我不知道在语音编码字符串上计算编辑距离是否有益,但可能是一种可能性。 或者,您可以使用启发式方法,例如:将字符串中的每个单词转换为其编码形式,并删除两个字符串中出现的任何单词,并在计算编辑距离之前将它们替换为单个表示形式。

To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.

旧话新听 2024-08-01 11:30:00

您可能会查找字符串“距离”,例如编辑距离

You might look for string "distances", for example the Levenshtein distance.

鸵鸟症 2024-08-01 11:30:00

Perl 模块 Text::Phonetic 具有各种算法的实现。

Perl module Text::Phonetic has implementations of various algorithms.

爱已欠费 2024-08-01 11:30:00

Jeff Atwood 写过关于寻找类似解决方案来确定 wiki 帖子的作者身份的文章这可能会帮助您缩小搜索范围。

Jeff Atwood wrote about looking for a similar solution for determining the authorship of wiki posts which may help you narrow your search.

蓝戈者 2024-08-01 11:30:00

如果您要比较 SQL 数据库中的值,则可以使用 SOUNDEX 函数。 如果您在 Google 上查询 SOUNDEX 和 C#,有些人已经为此和 VB 编写了类似的函数。

If you're comparing values in a SQL database you can use the SOUNDEX function. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.

无敌元气妹 2024-08-01 11:30:00

我也必须推荐 Soundex,我过去曾用它来处理拼写错误的城市名称。 这是一个很好的使用链接:http://whitepapers.zdnet.com/abstract。 aspx?docid=352953

I have to recommend Soundex too, I have used it in the past to process misspelt city names. Here is a good link for usage: http://whitepapers.zdnet.com/abstract.aspx?docid=352953

游魂 2024-08-01 11:30:00

如果您想进行语音比较,请查看 Soundex 和 Metaphone 算法: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex

If you want to compare phonetically, check out the Soundex and Metaphone algorithms: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex

无可置疑 2024-08-01 11:30:00

Metaphone 3 是第三代 Metaphone 算法。 它
将拼音编码的准确率提高了 Double 的 89%
变音位 98%,针对最常见的数据库进行测试
北方人熟悉的英语单词、名字和非英语单词
美国。 这产生了极其可靠的语音编码
美式发音。

Metaphone 3 由 Lawrence Philips 设计和开发,他
设计并开发了原始 Metaphone 和 Double Metaphone
算法。

Metaphone 3 is the third generation of the Metaphone algorithm. It
increases the accuracy of phonetic encoding from the 89% of Double
Metaphone to 98%, as tested against a database of the most common
English words, and names and non-English words familiar in North
America. This produces an extremely reliable phonetic encoding for
American pronunciations.

Metaphone 3 was designed and developed by Lawrence Philips, who
designed and developed the original Metaphone and Double Metaphone
algorithms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文