使用 Levenshtein 距离检测字符串相似度(速度慢)
该查询返回 2100 万条记录,
我循环遍历该表的方式需要很长时间。还有哪些其他解决方案?
SqlDataReader rd = DbInfo.DataRdr(Conn,
"SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
"FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");
while (rd.Read())
{
if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
{
Logging.Write(...);
}
}
public static int LevenshteinDistance(this string s, string t)
{
if (s == null)
throw new ArgumentNullException("s");
if (t == null)
throw new ArgumentNullException("t");
int n = s.Length;
int m = t.Length;
int[,] d = new int[n+1,m+1];
if (n == 0 || m == 0)
return Math.Max(m, n);
for (int i = 0; i <= n; i++)
{
d[i, 0] = i;
}
for (int i = 0; i < m; i++)
{
d[0, i] = i;
}
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
int cost = (t[j] == s[i]) ? 0 : 1;
d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
}
}
return d[n, m];
}
The query returns 21 million records
They way I loop through this table takes forever. What other solutions are there?
SqlDataReader rd = DbInfo.DataRdr(Conn,
"SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
"FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");
while (rd.Read())
{
if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
{
Logging.Write(...);
}
}
public static int LevenshteinDistance(this string s, string t)
{
if (s == null)
throw new ArgumentNullException("s");
if (t == null)
throw new ArgumentNullException("t");
int n = s.Length;
int m = t.Length;
int[,] d = new int[n+1,m+1];
if (n == 0 || m == 0)
return Math.Max(m, n);
for (int i = 0; i <= n; i++)
{
d[i, 0] = i;
}
for (int i = 0; i < m; i++)
{
d[0, i] = i;
}
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
int cost = (t[j] == s[i]) ? 0 : 1;
d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
}
}
return d[n, m];
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以首先使用以下内容作为查询,具体取决于
NUM
列实际上相等的频率:然后您的 SQL 查询将为您提供具有匹配
NUM
的行对您可以调用LevenshteinDistance
。像这样的东西:You could start by using the following as your query instead, depending on how often the
NUM
columns are actually equal:Then your SQL query will give you pairs of rows with matching
NUM
s that you can callLevenshteinDistance
on. Something like:您甚至将
dt1.Rows[i]["Name"].ToString()
与dt1.Rows[j]["Name"].ToString()
进行比较当 i == j 时。尝试循环
0 <= i < dt1.Rows.Count - 1 和 i + 1 <= j < dt1.Rows.Count
。此外,仅当
dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
时才记录,即可能是更快的检查。如果这是错误的,那么编辑编辑就没有意义了。编辑:@John 关于
dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
是正确的(都 <代码>我?)。You're comparing
dt1.Rows[i]["Name"].ToString()
withdt1.Rows[j]["Name"].ToString()
even wheni == j
.Try looping
0 <= i < dt1.Rows.Count - 1
andi + 1 <= j < dt1.Rows.Count
.Also, you're logging only if
dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
, which is probably a faster check. There's no point in doing Levenshtein if that's false.EDIT: @John is right about the
dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
(bothi
?).优化:
1) 查看您的数据。也许你可以做一些检查来更快地找出无效对。如果
Name
的长度变化超过 10,您可以检查s.Lenght
和t.Length
之间的差异是否大于 10 并返回马上就很大的距离(可能是 int.MaxValue 或只是 100)。如果距离明显超出范围,那么计算距离就没有意义了。2)寻找小的优化。超过 15 万行循环两次意味着 225 亿次迭代。微小的改变可能会产生巨大的影响。您可以尝试缓存行对象并使用
Equals()
方法删除ToString()
。我认为这比访问数据表的第 i 个元素 150000 次要快。3)尝试减少/分割数据集。执行查询以获取
NUM
的所有可能值select different NUM from myTable
。然后,对于结果中的每个NUM
执行原始查询,但使用 where 条件并仅选择名称:SELECT name from myTable where NUM = currentNum
。这样您就不必比较 NUM 行,也不必选择奇数数据。您的代码可以优化为仅执行编辑距离,但使用 1+2 中所述的优化。
4)尝试不同的方法,例如全文搜索。
我只想解决类似的问题,在 120 万行的表中查找匹配项。我使用了 lucene.net,它在搜索行的一个或多个属性时为我提供实时结果。
他们也进行编辑,但也许它比你的实现更快;) MSSQL Server 也支持全文搜索。
Optimizations:
1) Take a look at your data. Maybe you can do some checks to sort out invalid pairs faster. If the lenght of
Name
varies over more than 10 you can check if the difference betweens.Lenght
andt.Length
is greater 10 and return a high distance right away (maybe int.MaxValue or just 100). There is no point in calculating the distance if it`s clearly out of scope anyway.2) Look for small optimizations. Looping twice over 150k rows means 22.5 billion iterations. Small changes may have a great effect. You may try to cache to row objects and remove the
ToString()
by usign theEquals()
method. I think this would be faster than accessing the i-th element of your datatable 150000 times.3) Try to reduce/split the datasets. Execute a query to get all possible values of
NUM
select distinct NUM from myTable
. Then for eachNUM
in your result do your original query but using a where condition and select only the name:SELECT name from myTable where NUM = currentNum
.This way you don
t have to compare the NUM row and you don
t select odd data. Your code can be optimized to do just the levenshtein distance but using the optimizations stated in 1+2.4) Try a different approach like fulltext search.
I just hat to solve a similar problem, finding matches in a 1.2 million rows table. I used lucene.net which provides me realtime results when searching over one or more properties of my rows.
They do levenshtein too, but Maybe its faster than your implementation ;) MSSQL Server supports fulltext search, too.
您可以做出的最大改进是减少正在考虑的解决方案空间。由于您希望最大距离为 10,因此长度差异超过 10 的任何字符串都不可能符合条件:
接下来,分析您的代码并查看热点在哪里。一篇不错的入门文章:使用 Visual Studio Profiler 查找应用程序瓶颈。
并查看在 C# 中优化 Levenshtein 算法。
编辑
Chris还注意到,由于
levenshtein(a,b) == levenshtein(b,a)
,您只需要在连接上选择a.ID < b.ID,因为矩阵是对称的。这将使您的问题立即减少一半。The biggest improvement you can make is to reduce the solution space being considered. Since you want a max distance of 10, any strings that differ by length more than 10 cannot possibly qualify:
Next, profile your code and see where are the hot spots. A good entry article: Find Application Bottlenecks with Visual Studio Profiler.
And have a look at Optimizing the Levenshtein Algorithm in C#.
Edit
Also Chris noticed that since
levenshtein(a,b) == levenshtein(b,a)
you need only select on the join a.ID < b.ID, since the matrix is symmetrical. This will halve your problem right off the bat.完成本线程中提到的其他优化后,您可以将 Levenshtein 计算移至服务器,并仅选择与您的编辑距离条件匹配的行。我在项目中需要此功能,因此我用它创建了一个库,此处。 lib中使用的编辑距离方法只需要n * 2内存,而不是n * m。
例如,即使在服务器上,您也只想在字符串长度差 < 时进行 EditDistance 计算。 10,所以先检查一下。
像这样的东西
After doing the other optimizations mentioned in this thread, you can move the Levenshtein computation to the server and only SELECT the rows that matches your Edit distance criteria. I needed this functionality in a project, so I made a library out of it, here. The edit distance method used in the lib only requires n * 2 memory instead of n * m.
For instance, even when on the server, you only want to do the EditDistance computation when the difference in string length is < 10, so check for that first.
Something like