当前位置：文江博客话题详情

C# levenshtein-distance threshold

Damerau - Levenshtein 距离，添加阈值

发布于 2024-09-25 21:59:06 字数 1940 浏览 18 评论 0 原文

我有以下实现，但我想添加一个阈值，因此如果结果大于它，则停止计算并返回。

我该怎么办呢？

编辑：这是我当前的代码，尚未使用 threshold ...目标是使用它

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;

                var del = d[i - 1, j] + 1;
                var ins = d[i, j - 1] + 1;
                var sub = d[i - 1, j - 1] + cost;

                d[i, j] = Math.Min(del, Math.Min(ins, sub));

                if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                    d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
}

原文

I have the following implementation, but I want to add a threshold, so if the result is going to be greater than it, just stop calculating and return.

How would I go about that?

EDIT: Here is my current code, threshold is not yet used...the goal is that it is used

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;

                var del = d[i - 1, j] + 1;
                var ins = d[i, j - 1] + 1;
                var sub = d[i - 1, j - 1] + cost;

                d[i, j] = Math.Min(del, Math.Min(ins, sub));

                if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                    d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扛刀软妹 2024-10-02 21:59:07

这是关于您的回答：Damerau - Levenshtein 距离，添加阈值< /a>
（抱歉，无法发表评论，因为我还没有 50 名代表）

我认为您在这里犯了一个错误。您初始化了：

var minDistance = threshold;

并且您的更新规则是：

if (d[i, j] < minDistance)
   minDistance = d[i, j];

另外，您的早期退出标准是：

if (minDistance > threshold)
   return int.MaxValue;

现在，观察上面的 if 条件永远不会成立！您应该将 minDistance 初始化为 int.MaxValue

This is Regarding ur answer this: Damerau - Levenshtein Distance, adding a threshold
(sorry can't comment as I don't have 50 rep yet)

I think you have made an error here. You initialized:

var minDistance = threshold;

And your update rule is:

if (d[i, j] < minDistance)
   minDistance = d[i, j];

Also, your early exit criteria is:

if (minDistance > threshold)
   return int.MaxValue;

Now, observe that the if condition above will never hold true! You should rather initialize minDistance to int.MaxValue

回复收藏 0 原文

清晰传感 2024-10-02 21:59:07

这是我能想到的最优雅的方式。设置完d的每个索引后，看看它是否超出了你的阈值。评估是恒定时间的，因此与整个算法的理论 N^2 复杂度相比，它只是九牛一毛：

public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
    ...

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            ...

            var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));

            if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);

            //Does this value exceed your threshold? if so, get out now
            if(temp > threshold) 
              return temp;
        }
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}

Here's the most elegant way I can think of. After setting each index of d, see if it exceeds your threshold. The evaluation is constant-time, so it's a drop in the bucket compared to the theoretical N^2 complexity of the overall algorithm:

public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
    ...

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            ...

            var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));

            if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);

            //Does this value exceed your threshold? if so, get out now
            if(temp > threshold) 
              return temp;
        }
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}

回复收藏 0 原文

酒中人 2024-10-02 21:59:07

您还将此作为 SQL CLR UDF 问题提出，因此我将在特定上下文中回答：您最好的优化不是来自优化 Levenshtein 距离，而是来自减少比较的对数量。是的，更快的 Levenshtein 算法会改善情况，但不如将比较次数从 N 平方（其中 N 数百万行）减少到 N*某个因子那么多。我的建议是仅比较长度差异在可容忍增量内的元素。在大表上，您可以在 LEN(Data) 上添加一个持久计算列，然后使用包含数据在其上创建索引：

ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);

现在您可以通过加入长度上的最大差异来限制纯粹的问题空间（例如，假设 5），如果您的数据的 LEN(Data) 变化很大：

SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5

You also asked this as a SQL CLR UDF question so I'll answer in that specific context: you best optmiziation won't come from optimizing the Levenshtein distance, but from reducing the number of pairs you compare. Yes, a faster Levenshtein algorithm will improve things, but not nearly as much as reducing the number of comparisons from N square (with N in the millions of rows) to N*some factor. My proposal is to compare only elements who have the length difference within a tolerable delta. On your big table, you add a persisted computed column on LEN(Data) and then create an index on it with include Data:

ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);

Now you can restrict the sheer problem space by joining within an max difference on lenght (eg. say 5), if your data's LEN(Data) varies significantly:

SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5

回复收藏 0 原文

記憶穿過時間隧道 2024-10-02 21:59:07

终于收到了……虽然没有我想象的那么好用

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;

            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold 
            ? int.MaxValue 
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }

Finally got it...though it's not as beneficial as I had hoped

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;

            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold 
            ? int.MaxValue 
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }

回复收藏 0 原文

~没有更多了~