Damerau - Levenshtein 距离,添加阈值
我有以下实现,但我想添加一个阈值,因此如果结果大于它,则停止计算并返回。
我该怎么办呢?
编辑:这是我当前的代码,尚未使用 threshold
...目标是使用它
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;
var del = d[i - 1, j] + 1;
var ins = d[i, j - 1] + 1;
var sub = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
}
I have the following implementation, but I want to add a threshold, so if the result is going to be greater than it, just stop calculating and return.
How would I go about that?
EDIT: Here is my current code, threshold
is not yet used...the goal is that it is used
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;
var del = d[i - 1, j] + 1;
var ins = d[i, j - 1] + 1;
var sub = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是关于您的回答:Damerau - Levenshtein 距离,添加阈值< /a>
(抱歉,无法发表评论,因为我还没有 50 名代表)
我认为您在这里犯了一个错误。您初始化了:
并且您的更新规则是:
另外,您的早期退出标准是:
现在,观察上面的 if 条件永远不会成立!您应该将
minDistance
初始化为int.MaxValue
This is Regarding ur answer this: Damerau - Levenshtein Distance, adding a threshold
(sorry can't comment as I don't have 50 rep yet)
I think you have made an error here. You initialized:
And your update rule is:
Also, your early exit criteria is:
Now, observe that the if condition above will never hold true! You should rather initialize
minDistance
toint.MaxValue
这是我能想到的最优雅的方式。设置完d的每个索引后,看看它是否超出了你的阈值。评估是恒定时间的,因此与整个算法的理论 N^2 复杂度相比,它只是九牛一毛:
Here's the most elegant way I can think of. After setting each index of d, see if it exceeds your threshold. The evaluation is constant-time, so it's a drop in the bucket compared to the theoretical N^2 complexity of the overall algorithm:
您还将此作为 SQL CLR UDF 问题提出,因此我将在特定上下文中回答:您最好的优化不是来自优化 Levenshtein 距离,而是来自减少比较的对数量。是的,更快的 Levenshtein 算法会改善情况,但不如将比较次数从 N 平方(其中 N 数百万行)减少到 N*某个因子那么多。我的建议是仅比较长度差异在可容忍增量内的元素。在大表上,您可以在
LEN(Data)
上添加一个持久计算列,然后使用包含数据在其上创建索引:现在您可以通过加入长度上的最大差异来限制纯粹的问题空间(例如,假设 5),如果您的数据的
LEN(Data)
变化很大:You also asked this as a SQL CLR UDF question so I'll answer in that specific context: you best optmiziation won't come from optimizing the Levenshtein distance, but from reducing the number of pairs you compare. Yes, a faster Levenshtein algorithm will improve things, but not nearly as much as reducing the number of comparisons from N square (with N in the millions of rows) to N*some factor. My proposal is to compare only elements who have the length difference within a tolerable delta. On your big table, you add a persisted computed column on
LEN(Data)
and then create an index on it with include Data:Now you can restrict the sheer problem space by joining within an max difference on lenght (eg. say 5), if your data's
LEN(Data)
varies significantly:终于收到了……虽然没有我想象的那么好用
Finally got it...though it's not as beneficial as I had hoped