当前位置：文江博客话题详情

如何衡量两个字符串之间的相似度？

发布于 2024-07-25 11:30:00 字数 1438 浏览 19 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

囚你心 2024-08-01 11:30:00

有多种不同的方法可以做到这一点。查看维基百科“字符串相似性度量”页面，获取其他包含算法的页面的链接。

然而，我不认为这些算法中的任何一个都会考虑声音 - 因此“staq 溢出”将与“堆栈溢出”和“staw 溢出”类似，尽管第一个在术语上更相似的发音。

我刚刚找到另一个页面，它提供了更多选项。特别是 Soundex 算法 (Wikipedia）可能更接近您所追求的内容。

回复收藏 0 原文

私藏温柔 2024-08-01 11:30:00

Levenshtein 距离可能就是您正在寻找的。

回复收藏 0 原文

许你一世情深 2024-08-01 11:30:00

这是我为我正在从事的项目编写的一些代码。我需要知道字符串的相似率和基于字符串单词的相似率。
最后一个，我想知道最小字符串的单词相似度（因此，如果所有单词都存在并在较大字符串中匹配，则结果将为 100%）和较大字符串的单词相似度（我称之为 RealWordsRatio））。
我使用 Levenshtein 算法来计算距离。到目前为止，代码尚未优化，但它按预期工作。希望对你有帮助。

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }

Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings.
This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio).
I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }

回复收藏 0 原文

回忆追雨的时光 2024-08-01 11:30:00

不久前，我用 C# 编写了一个 Double Metaphone 实现。您会发现它远远优于 Soundex 等。

编辑距离也被提出，它是一个有很多用途的伟大算法，但语音匹配并不是它真正的作用；只是有时看起来是这样，因为语音相似的单词通常拼写也相似。我对各种模糊匹配算法进行了分析，您可能也会发现它很有用。

回复收藏 0 原文

只怪假的太真实 2024-08-01 11:30:00

为了处理“声音相似”的问题，您可能需要考虑使用语音算法（例如 Double Metaphone 或 soundex）进行编码。我不知道在语音编码字符串上计算编辑距离是否有益，但可能是一种可能性。或者，您可以使用启发式方法，例如：将字符串中的每个单词转换为其编码形式，并删除两个字符串中出现的任何单词，并在计算编辑距离之前将它们替换为单个表示形式。

回复收藏 0 原文

旧话新听 2024-08-01 11:30:00

您可能会查找字符串“距离”，例如编辑距离。

回复收藏 0 原文

鸵鸟症 2024-08-01 11:30:00

Perl 模块 Text::Phonetic 具有各种算法的实现。

回复收藏 0 原文

爱已欠费 2024-08-01 11:30:00

Jeff Atwood 写过关于寻找类似解决方案来确定 wiki 帖子的作者身份的文章这可能会帮助您缩小搜索范围。

回复收藏 0 原文

蓝戈者 2024-08-01 11:30:00

如果您要比较 SQL 数据库中的值，则可以使用 SOUNDEX 函数。如果您在 Google 上查询 SOUNDEX 和 C#，有些人已经为此和 VB 编写了类似的函数。

回复收藏 0 原文

无敌元气妹 2024-08-01 11:30:00

我也必须推荐 Soundex，我过去曾用它来处理拼写错误的城市名称。这是一个很好的使用链接：http://whitepapers.zdnet.com/abstract。 aspx?docid=352953

回复收藏 0 原文

游魂 2024-08-01 11:30:00

如果您想进行语音比较，请查看 Soundex 和 Metaphone 算法： http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex

回复收藏 0 原文

无可置疑 2024-08-01 11:30:00

Metaphone 3 是第三代 Metaphone 算法。它
将拼音编码的准确率提高了 Double 的 89%
变音位 98%，针对最常见的数据库进行测试
北方人熟悉的英语单词、名字和非英语单词
美国。这产生了极其可靠的语音编码
美式发音。
Metaphone 3 由 Lawrence Philips 设计和开发，他
设计并开发了原始 Metaphone 和 Double Metaphone
算法。

回复收藏 0 原文

~没有更多了~

关于作者

黑寡妇

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何衡量两个字符串之间的相似度？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如何衡量两个字符串之间的相似度？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。