比较字符串与容差

发布于 2024-08-23 02:55:32 字数 124 浏览 9 评论 0原文

我正在寻找一种将字符串与字符串数组进行比较的方法。当然,进行精确搜索非常容易,但我希望我的程序能够容忍拼写错误、丢失部分字符串等。

是否有某种框架可以执行此类搜索?我想到搜索算法将按匹配百分比或类似的内容返回一些结果。

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.

Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

清泪尽 2024-08-30 02:55:32

您可以使用Levenshtein 距离算法

“两个字符串之间的编辑距离定义为将一个字符串转换为另一个字符串所需的最小编辑次数,允许的编辑操作是插入、删除或替换单个字符。” - Wikipedia.com

这个来自 dotnetperls.com

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

事实上,您可能更喜欢使用 Damerau-Levenshtein 距离算法,该算法还允许字符转置,这是一种常见的人为错误数据输入。您可以在此处找到它的 C# 实现。

You could use the Levenshtein Distance algorithm.

"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com

This one is from dotnetperls.com:

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

箹锭⒈辈孓 2024-08-30 02:55:32

.NET 框架中没有任何东西可以帮助您完成此开箱即用的操作。

最常见的拼写错误是字母是单词的正确语音表示,但不是单词的正确拼写。

例如,可以认为单词 swordsord(是的,这是一个单词)具有相同的音根(当您发音时,它们听起来相同)。

话虽如此,您可以使用多种算法将单词(甚至拼写错误的单词)翻译为语音变体。

第一个是 Soundex。它的实现相当简单,并且有相当多的该算法的 .NET 实现。它相当简单,但它为您提供了可以相互比较的真实价值。

另一个是Metaphone。虽然我找不到 Metaphone 的本机 .NET 实现,但提供的链接包含指向许多可以转换的其他实现的链接。最容易转换的可能是Metaphone 算法的 Java 实现

值得注意的是,Metaphone 算法已经过修订。有 Double Metaphone (它有一个 .NET 实现) 和 Metaphone 3。 Metaphone 3 是一个商业应用程序,但在针对常见英语单词的数据库运行时,其准确率高达 98%,而 Double Metaphone 算法的准确率仅为 89%。根据您的需要,您可能需要寻找(在 Double Metaphone 的情况下)或购买(在 Metaphone 3 的情况下)算法源,并通过 P/Invoke 层转换或访问它(有 C++ 实现)盛产)。

Metaphone 和 Soundex 的不同之处在于 Soundex 生成固定长度的数字键,而 Metaphone 生成不同长度的键,因此结果会不同。最后,两者都会为您进行相同的比较,您只需根据您的要求和资源(当然还有对拼写错误的不容忍程度)找出最适合您的需求的比较。

There is nothing in the .NET framework that will help you with this out-of-the-box.

The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.

For example, it could be argued that the words sword and sord (yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).

That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.

The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.

Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.

It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).

Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

紫轩蝶泪 2024-08-30 02:55:32

下面是 LevenshteinDistance 方法的实现,它在产生相同结果的同时使用更少的内存。这是伪代码的 C# 改编版,可在这篇 wikipedia 文章 的“迭代两个矩阵”下找到行”标题。

public static int LevenshteinDistance(string source, string target)
{
    // degenerate cases
    if (source == target) return 0;
    if (source.Length == 0) return target.Length;
    if (target.Length == 0) return source.Length;

    // create two work vectors of integer distances
    int[] v0 = new int[target.Length + 1];
    int[] v1 = new int[target.Length + 1];

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for (int i = 0; i < v0.Length; i++)
        v0[i] = i;

    for (int i = 0; i < source.Length; i++)
    {
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1;

        // use formula to fill in the rest of the row
        for (int j = 0; j < target.Length; j++)
        {
            var cost = (source[i] == target[j]) ? 0 : 1;
            v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
        }

        // copy v1 (current row) to v0 (previous row) for next iteration
        for (int j = 0; j < v0.Length; j++)
            v0[j] = v1[j];
    }

    return v1[target.Length];
}

这是一个可以给你百分比相似度的函数。

/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
    if ((source == null) || (target == null)) return 0.0;
    if ((source.Length == 0) || (target.Length == 0)) return 0.0;
    if (source == target) return 1.0;

    int stepsToSame = LevenshteinDistance(source, target);
    return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}

Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia article under the "Iterative with two matrix rows" heading.

public static int LevenshteinDistance(string source, string target)
{
    // degenerate cases
    if (source == target) return 0;
    if (source.Length == 0) return target.Length;
    if (target.Length == 0) return source.Length;

    // create two work vectors of integer distances
    int[] v0 = new int[target.Length + 1];
    int[] v1 = new int[target.Length + 1];

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for (int i = 0; i < v0.Length; i++)
        v0[i] = i;

    for (int i = 0; i < source.Length; i++)
    {
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1;

        // use formula to fill in the rest of the row
        for (int j = 0; j < target.Length; j++)
        {
            var cost = (source[i] == target[j]) ? 0 : 1;
            v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
        }

        // copy v1 (current row) to v0 (previous row) for next iteration
        for (int j = 0; j < v0.Length; j++)
            v0[j] = v1[j];
    }

    return v1[target.Length];
}

Here is a function that will give you the percentage similarity.

/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
    if ((source == null) || (target == null)) return 0.0;
    if ((source.Length == 0) || (target.Length == 0)) return 0.0;
    if (source == target) return 1.0;

    int stepsToSame = LevenshteinDistance(source, target);
    return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}
℉絮湮 2024-08-30 02:55:32

您的另一个选择是使用 Soundex 或 Metaphone 进行语音比较。我刚刚完成了一篇文章,介绍了这两种算法的 C# 代码。您可以在 http://www.blackbeltcoder.com/ 查看它文章/算法/phonetic-string-comparison-with-soundex

Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.

蓝眸 2024-08-30 02:55:32

以下是计算字符串之间的Levenshtein 距离的两种方法。

两个字符串之间的编辑距离定义为将一个字符串转换为另一个字符串所需的最小编辑次数,允许的编辑操作是插入、删除或替换单个字符。

获得结果后,您需要定义要使用什么值作为匹配或不匹配的阈值。在一堆示例数据上运行该函数,以了解它如何工作,以帮助确定您的特定阈值。

    /// <summary>
    /// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
    /// </summary>
    /// <param name="first">The first string, used as a source.</param>
    /// <param name="second">The second string, used as a target.</param>
    /// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
    /// <remarks>
    /// From http://www.merriampark.com/ldcsharp.htm
    /// </remarks>
    public static int LevenshteinDistance(string first, string second)
    {
        if (first == null)
        {
            throw new ArgumentNullException("first");
        }
        if (second == null)
        {
            throw new ArgumentNullException("second");
        }

        int n = first.Length;
        int m = second.Length;
        var d = new int[n + 1, m + 1]; // matrix

        if (n == 0) return m;
        if (m == 0) return n;

        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        for (int i = 1; i <= n; i++)
        {

            for (int j = 1; j <= m; j++)
            {
                int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
                d[i, j] = Math.Min(
                    Math.Min(
                        d[i - 1, j] + 1,
                        d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }

        return d[n, m];
    }

Here are two methods that calculate the Levenshtein Distance between strings.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.

    /// <summary>
    /// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
    /// </summary>
    /// <param name="first">The first string, used as a source.</param>
    /// <param name="second">The second string, used as a target.</param>
    /// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
    /// <remarks>
    /// From http://www.merriampark.com/ldcsharp.htm
    /// </remarks>
    public static int LevenshteinDistance(string first, string second)
    {
        if (first == null)
        {
            throw new ArgumentNullException("first");
        }
        if (second == null)
        {
            throw new ArgumentNullException("second");
        }

        int n = first.Length;
        int m = second.Length;
        var d = new int[n + 1, m + 1]; // matrix

        if (n == 0) return m;
        if (m == 0) return n;

        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        for (int i = 1; i <= n; i++)
        {

            for (int j = 1; j <= m; j++)
            {
                int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
                d[i, j] = Math.Min(
                    Math.Min(
                        d[i - 1, j] + 1,
                        d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }

        return d[n, m];
    }
鹿童谣 2024-08-30 02:55:32
using System;
 
public class Example
{
    public static int getEditDistance(string X, string Y)
    {
        int m = X.Length;
        int n = Y.Length;
 
        int[][] T = new int[m + 1][];
        for (int i = 0; i < m + 1; ++i) {
            T[i] = new int[n + 1];
        }
 
        for (int i = 1; i <= m; i++) {
            T[i][0] = i;
        }
        for (int j = 1; j <= n; j++) {
            T[0][j] = j;
        }
 
        int cost;
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                cost = X[i - 1] == Y[j - 1] ? 0: 1;
                T[i][j] = Math.Min(Math.Min(T[i - 1][j] + 1, T[i][j - 1] + 1),
                        T[i - 1][j - 1] + cost);
            }
        }
 
        return T[m][n];
    }
 
    public static double findSimilarity(string x, string y) {
        if (x == null || y == null) {
            throw new ArgumentException("Strings must not be null");
        }
 
        double maxLength = Math.Max(x.Length, y.Length);
        if (maxLength > 0) {
            // optionally ignore case if needed
            return (maxLength - getEditDistance(x, y)) / maxLength;
        }
        return 1.0;
    }
 
 
    public static void Main()
    {
        string s1 = "Techie Delight";
        string s2 = "Tech Delight";
 
        double similarity = findSimilarity(s1, s2) * 100;
 
        Console.WriteLine(similarity);        // 85.71428571428571
    }
}
using System;
 
public class Example
{
    public static int getEditDistance(string X, string Y)
    {
        int m = X.Length;
        int n = Y.Length;
 
        int[][] T = new int[m + 1][];
        for (int i = 0; i < m + 1; ++i) {
            T[i] = new int[n + 1];
        }
 
        for (int i = 1; i <= m; i++) {
            T[i][0] = i;
        }
        for (int j = 1; j <= n; j++) {
            T[0][j] = j;
        }
 
        int cost;
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                cost = X[i - 1] == Y[j - 1] ? 0: 1;
                T[i][j] = Math.Min(Math.Min(T[i - 1][j] + 1, T[i][j - 1] + 1),
                        T[i - 1][j - 1] + cost);
            }
        }
 
        return T[m][n];
    }
 
    public static double findSimilarity(string x, string y) {
        if (x == null || y == null) {
            throw new ArgumentException("Strings must not be null");
        }
 
        double maxLength = Math.Max(x.Length, y.Length);
        if (maxLength > 0) {
            // optionally ignore case if needed
            return (maxLength - getEditDistance(x, y)) / maxLength;
        }
        return 1.0;
    }
 
 
    public static void Main()
    {
        string s1 = "Techie Delight";
        string s2 = "Tech Delight";
 
        double similarity = findSimilarity(s1, s2) * 100;
 
        Console.WriteLine(similarity);        // 85.71428571428571
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文