使用 Levenshtein 距离检测字符串相似度(速度慢)

发布于 2024-11-19 13:11:44 字数 1285 浏览 4 评论 0原文

该查询返回 2100 万条记录,

我循环遍历该表的方式需要很长时间。还有哪些其他解决方案?

SqlDataReader rd = DbInfo.DataRdr(Conn,
 "SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
 "FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");

while (rd.Read())
{
   if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
   {

        Logging.Write(...);

   }
}



    public static int LevenshteinDistance(this string s, string t)
    {
        if (s == null)
            throw new ArgumentNullException("s");
        if (t == null)
            throw new ArgumentNullException("t");
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n+1,m+1];

        if (n == 0 || m == 0)
            return Math.Max(m, n);

        for (int i = 0; i <= n; i++)
        {
            d[i, 0] = i;
        }
        for (int i = 0; i < m; i++)
        {
            d[0, i] = i;
        }

        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < m; j++)
            {
                int cost = (t[j] == s[i]) ? 0 : 1;

                d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
            }
        }

        return d[n, m];
    }

The query returns 21 million records

They way I loop through this table takes forever. What other solutions are there?

SqlDataReader rd = DbInfo.DataRdr(Conn,
 "SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
 "FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");

while (rd.Read())
{
   if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
   {

        Logging.Write(...);

   }
}



    public static int LevenshteinDistance(this string s, string t)
    {
        if (s == null)
            throw new ArgumentNullException("s");
        if (t == null)
            throw new ArgumentNullException("t");
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n+1,m+1];

        if (n == 0 || m == 0)
            return Math.Max(m, n);

        for (int i = 0; i <= n; i++)
        {
            d[i, 0] = i;
        }
        for (int i = 0; i < m; i++)
        {
            d[0, i] = i;
        }

        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < m; j++)
            {
                int cost = (t[j] == s[i]) ? 0 : 1;

                d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
            }
        }

        return d[n, m];
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

伪心 2024-11-26 13:11:44

您可以首先使用以下内容作为查询,具体取决于 NUM 列实际上相等的频率:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, other things
FROM myTable a
JOIN myTable b
ON a.NUM = b.NUM
AND a.id < b.id

然后您的 SQL 查询将为您提供具有匹配 NUM 的行对您可以调用 LevenshteinDistance 。像这样的东西:

DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "[the query I wrote above]"));

for (int i = 0; i < dt1.Rows.Count; i++)
{
   if (dt1.Rows[i]["ANAME"].ToString().LevenshteinDistance(dt1.Rows[i]["BNAME"].ToString()) <= 10)
   {
     Logging.Write(...[modify the query so it returns the things you want to log]...);
   }
}

You could start by using the following as your query instead, depending on how often the NUM columns are actually equal:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, other things
FROM myTable a
JOIN myTable b
ON a.NUM = b.NUM
AND a.id < b.id

Then your SQL query will give you pairs of rows with matching NUMs that you can call LevenshteinDistance on. Something like:

DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "[the query I wrote above]"));

for (int i = 0; i < dt1.Rows.Count; i++)
{
   if (dt1.Rows[i]["ANAME"].ToString().LevenshteinDistance(dt1.Rows[i]["BNAME"].ToString()) <= 10)
   {
     Logging.Write(...[modify the query so it returns the things you want to log]...);
   }
}
ζ澈沫 2024-11-26 13:11:44

您甚至将 dt1.Rows[i]["Name"].ToString()dt1.Rows[j]["Name"].ToString() 进行比较当 i == j 时。

尝试循环 0 <= i < dt1.Rows.Count - 1 和 i + 1 <= j < dt1.Rows.Count

此外,仅当 dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() 时才记录,即可能是更快的检查。如果这是错误的,那么编辑编辑就没有意义了。

编辑:@John 关于 dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() 是正确的(都 <代码>我?)。

You're comparing dt1.Rows[i]["Name"].ToString() with dt1.Rows[j]["Name"].ToString() even when i == j.

Try looping 0 <= i < dt1.Rows.Count - 1 and i + 1 <= j < dt1.Rows.Count.

Also, you're logging only if dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString(), which is probably a faster check. There's no point in doing Levenshtein if that's false.

EDIT: @John is right about the dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() (both i?).

坚持沉默 2024-11-26 13:11:44

优化

1) 查看您的数据。也许你可以做一些检查来更快地找出无效对。如果 Name 的长度变化超过 10,您可以检查 s.Lenghtt.Length 之间的差异是否大于 10 并返回马上就很大的距离(可能是 int.MaxValue 或只是 100)。如果距离明显超出范围,那么计算距离就没有意义了。

2)寻找小的优化。超过 15 万行循环两次意味着 225 亿次迭代。微小的改变可能会产生巨大的影响。您可以尝试缓存行对象并使用 Equals() 方法删除 ToString() 。我认为这比访问数据表的第 i 个元素 150000 次要快。

for (int i = 0; i < dt1.Rows.Count; i++)
{
   var outerRow = dt1.Rows[i];
   for (int j = 0; i + 1 < dt1.Rows.Count; j++)
   {
     var innerRow = dt1.Rows[j];
     if (Equals(outerRow["NUM"] == innerRow["NUM"]))
     {
        if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10)
        {
           Logging.Write(...);
        }
     }
  }

3)尝试减少/分割数据集。执行查询以获取 NUM 的所有可能值 select different NUM from myTable。然后,对于结果中的每个 NUM 执行原始查询,但使用 where 条件并仅选择名称:SELECT name from myTable where NUM = currentNum

这样您就不必比较 NUM 行,也不必选择奇数数据。您的代码可以优化为仅执行编辑距离,但使用 1+2 中所述的优化。

4)尝试不同的方法,例如全文搜索。

我只想解决类似的问题,在 120 万行的表中查找匹配项。我使用了 lucene.net,它在搜索行的一个或多个属性时为我提供实时结果。

他们也进行编辑,但也许它比你的实现更快;) MSSQL Server 也支持全文搜索。

Optimizations:

1) Take a look at your data. Maybe you can do some checks to sort out invalid pairs faster. If the lenght of Name varies over more than 10 you can check if the difference between s.Lenght and t.Length is greater 10 and return a high distance right away (maybe int.MaxValue or just 100). There is no point in calculating the distance if it`s clearly out of scope anyway.

2) Look for small optimizations. Looping twice over 150k rows means 22.5 billion iterations. Small changes may have a great effect. You may try to cache to row objects and remove the ToString() by usign the Equals() method. I think this would be faster than accessing the i-th element of your datatable 150000 times.

for (int i = 0; i < dt1.Rows.Count; i++)
{
   var outerRow = dt1.Rows[i];
   for (int j = 0; i + 1 < dt1.Rows.Count; j++)
   {
     var innerRow = dt1.Rows[j];
     if (Equals(outerRow["NUM"] == innerRow["NUM"]))
     {
        if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10)
        {
           Logging.Write(...);
        }
     }
  }

3) Try to reduce/split the datasets. Execute a query to get all possible values of NUM select distinct NUM from myTable. Then for each NUM in your result do your original query but using a where condition and select only the name: SELECT name from myTable where NUM = currentNum.

This way you dont have to compare the NUM row and you dont select odd data. Your code can be optimized to do just the levenshtein distance but using the optimizations stated in 1+2.

4) Try a different approach like fulltext search.

I just hat to solve a similar problem, finding matches in a 1.2 million rows table. I used lucene.net which provides me realtime results when searching over one or more properties of my rows.

They do levenshtein too, but Maybe its faster than your implementation ;) MSSQL Server supports fulltext search, too.

你是年少的欢喜 2024-11-26 13:11:44

您可以做出的最大改进是减少正在考虑的解决方案空间。由于您希望最大距离为 10,因此长度差异超过 10 的任何字符串都不可能符合条件:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD 
 FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID < b.ID
 WHERE length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10;

接下来,分析您的代码并查看热点在哪里。一篇不错的入门文章:使用 Visual Studio Profiler 查找应用程序瓶颈

并查看在 C# 中优化 Levenshtein 算法。

编辑

Chris还注意到,由于levenshtein(a,b) == levenshtein(b,a),您只需要在连接上选择a.ID < b.ID,因为矩阵是对称的。这将使您的问题立即减少一半。

The biggest improvement you can make is to reduce the solution space being considered. Since you want a max distance of 10, any strings that differ by length more than 10 cannot possibly qualify:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD 
 FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID < b.ID
 WHERE length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10;

Next, profile your code and see where are the hot spots. A good entry article: Find Application Bottlenecks with Visual Studio Profiler.

And have a look at Optimizing the Levenshtein Algorithm in C#.

Edit

Also Chris noticed that since levenshtein(a,b) == levenshtein(b,a) you need only select on the join a.ID < b.ID, since the matrix is symmetrical. This will halve your problem right off the bat.

李不 2024-11-26 13:11:44

完成本线程中提到的其他优化后,您可以将 Levenshtein 计算移至服务器,并仅选择与您的编辑距离条件匹配的行。我在项目中需要此功能,因此我用它创建了一个库,此处。 lib中使用的编辑距离方法只需要n * 2内存,而不是n * m。

例如,即使在服务器上,您也只想在字符串长度差 < 时进行 EditDistance 计算。 10,所以先检查一下。
像这样的东西

SELECT a.NAME as NameA, b.NAME as NameB
FROM table a
JOIN table b ON a.NUM = b.NUM
WHERE a.Id < b.Id
AND length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10 OR
    EditDistance(a.Name, b.Name) < 10

After doing the other optimizations mentioned in this thread, you can move the Levenshtein computation to the server and only SELECT the rows that matches your Edit distance criteria. I needed this functionality in a project, so I made a library out of it, here. The edit distance method used in the lib only requires n * 2 memory instead of n * m.

For instance, even when on the server, you only want to do the EditDistance computation when the difference in string length is < 10, so check for that first.
Something like

SELECT a.NAME as NameA, b.NAME as NameB
FROM table a
JOIN table b ON a.NUM = b.NUM
WHERE a.Id < b.Id
AND length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10 OR
    EditDistance(a.Name, b.Name) < 10
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文