Cython Damerau-Levenshtein 加速

发布于 2024-10-31 03:37:47 字数 1658 浏览 0 评论 0原文

我有以下 cython 实现，基于计算 2 个字符串的 Damerau–Levenshtein 距离这篇维基百科文章，但目前它对于我的需求来说太慢了。我有一个大约 600000 个字符串的列表，我必须在该列表中找到拼写错误。

如果有人能建议任何算法改进或一些可以减少脚本运行时间的 python/cython 魔法，我会很高兴。我并不关心它使用了多少空间，只关心计算所需的时间。

根据使用大约 2000 个字符串对脚本进行分析，它在 damerauLevenshteinDistance 函数中花费了整个运行时间的 80%（30 秒中的 24 秒），而我完全不知道如何使其更快。

def damerauLevenshteinDistance(a, b, h):
    """
    a = source sequence
    b = comparing sequence
    h = matrix to store the metrics (currently nested list)
    """
    cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
    alphabet = getAlphabet((a,b))
    lena = len(a)
    lenb = len(b)
    inf = lena + lenb + 1
    da = [0 for x in xrange(0, len(alphabet))]
    for i in xrange(1, lena+1):
        db = 0
        for j in xrange(1, lenb+1):
            i1 = da[alphabet[b[j-1]]]
            j1 = db
            d = 1
            if (a[i-1] == b[j-1]):
                d = 0
                db = j
            h[i+1][j+1] = min(
                h[i][j]+d,
                h[i+1][j]+1,
                h[i][j+1]+1,
                h[i1][j1]+(i-i1-1)+1+(j-j1-1)
            )
        da[alphabet[a[i-1]]] = i
    return h[lena+1][lenb+1]

cdef getAlphabet(words):
    """
    construct an alphabet out of the lists found in the tuple words with a
    sequential identifier for each word
    """
    cdef int i
    alphabet = {}
    i = 0
    for wordList in words:
        for letter in wordList:
            if letter not in alphabet:
                alphabet[letter] = i
                i += 1
    return alphabet

原文

I have the following cython implementation of calculating the Damerau–Levenshtein distance of 2 strings, based on this Wikipedia article, but currently it is too slow for my needs. I have a list of about 600000 strings and I have to find typos in that list.

I would be glad if anyone could suggest any algorithmic improvements or some python/cython magic that could reduce the runtime of the script. I don't really care about how much space it uses only the time it takes to calculate.

According to profiling the script using about 2000 strings it spends 80% of the complete runtime (24 of 30 sec) in the damerauLevenshteinDistance function, and I'm all out of ideas how to make it faster.

def damerauLevenshteinDistance(a, b, h):
    """
    a = source sequence
    b = comparing sequence
    h = matrix to store the metrics (currently nested list)
    """
    cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
    alphabet = getAlphabet((a,b))
    lena = len(a)
    lenb = len(b)
    inf = lena + lenb + 1
    da = [0 for x in xrange(0, len(alphabet))]
    for i in xrange(1, lena+1):
        db = 0
        for j in xrange(1, lenb+1):
            i1 = da[alphabet[b[j-1]]]
            j1 = db
            d = 1
            if (a[i-1] == b[j-1]):
                d = 0
                db = j
            h[i+1][j+1] = min(
                h[i][j]+d,
                h[i+1][j]+1,
                h[i][j+1]+1,
                h[i1][j1]+(i-i1-1)+1+(j-j1-1)
            )
        da[alphabet[a[i-1]]] = i
    return h[lena+1][lenb+1]

cdef getAlphabet(words):
    """
    construct an alphabet out of the lists found in the tuple words with a
    sequential identifier for each word
    """
    cdef int i
    alphabet = {}
    i = 0
    for wordList in words:
        for letter in wordList:
            if letter not in alphabet:
                alphabet[letter] = i
                i += 1
    return alphabet

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱要勇敢去追 2024-11-07 03:37:49

我最近刚刚开源了 Damerau-Levenshtein 算法的 Cython 实现。我包括 pyx 和 C 源代码。

https://github.com/gfairchild/pyxDamerauLevenshtein

回复收藏 0 原文

等待我真够勒 2024-11-07 03:37:48

至少对于较长的字符串，您应该通过使用不同的算法来获得更好的性能，该算法不必计算 lena⋅lenb 矩阵中的所有值。例如，通常可能不需要计算矩阵的 [lena][0] 角的确切成本，它表示从删除 a.

更好的算法可能是始终查看迄今为止计算出的权重最低的点，然后从那里向各个方向更进一步。这样您就可以到达目标位置，而无需检查矩阵中的所有位置：

该算法的实现可以使用优先级队列，如下所示：

from heapq import heappop, heappush

def distance(a, b):
   pq = [(0,0,0)]
   lena = len(a)
   lenb = len(b)
   while True:
      (wgh, i, j) = heappop(pq)
      if i == lena and j == lenb:
         return wgh
      if i < lena:
         # deleted
         heappush(pq, (wgh+1, i+1, j))
      if j < lenb:
         # inserted
         heappush(pq, (wgh+1, i, j+1))
      if i < lena and j < lenb:
         if a[i] == b[i]:
            # unchanged
            heappush(pq, (wgh, i+1, j+1))
         else:
            # changed
            heappush(pq, (wgh+1, i+1, j+1))
      # ... more possibilities for changes, like your "+(i-i1-1)+1+(j-j1-1)"

这只是一个粗略的实现，可以进行很大改进：

添加新坐标时到队列中，检查：
- 如果之前已经处理过坐标，则不要再次添加
- 如果坐标当前在队列中，则仅保留具有更好附加权重的实例
使用 C 实现的优先级队列而不是 heapq模块

At least for longer strings you should get better performance by using a different algorithm that doesn't have to calculate all the values in the lena⋅lenb Matrix. For example it might often not be necessary to calculate the exact cost of the [lena][0] corner of the matrix, which represents the cost of starting by deleting all characters in a.

A better algorithm might be to always look at the point with the lowest weight calculated so far, and then go one step further in all directions from there. This way you might reach the target location without examining all locations in the matrix:

An implementation of this algorithm could use a priority queue and would look like this:

from heapq import heappop, heappush

def distance(a, b):
   pq = [(0,0,0)]
   lena = len(a)
   lenb = len(b)
   while True:
      (wgh, i, j) = heappop(pq)
      if i == lena and j == lenb:
         return wgh
      if i < lena:
         # deleted
         heappush(pq, (wgh+1, i+1, j))
      if j < lenb:
         # inserted
         heappush(pq, (wgh+1, i, j+1))
      if i < lena and j < lenb:
         if a[i] == b[i]:
            # unchanged
            heappush(pq, (wgh, i+1, j+1))
         else:
            # changed
            heappush(pq, (wgh+1, i+1, j+1))
      # ... more possibilities for changes, like your "+(i-i1-1)+1+(j-j1-1)"

This is just a rough implementation, it could be improved a lot:

When adding new coordinates to the queue, check:
- If the coordinates have already been processed before, don't add them again
- If the coordinates are currently in the queue, only keep the instance with the better attached weight
Use a priority queue implemented in C instead of the heapq module

回复收藏 0 原文

妄司 2024-11-07 03:37:48

看起来您可以静态地输入比当前更多的代码，这会提高速度。

您还可以查看 Cython 中 Levenshtein Distance 的实现作为示例：
http://hackmap.blogspot.com/2008/04/levenshtein-in -cython.html

回复收藏 0 原文

可是我不能没有你 2024-11-07 03:37:48

我的猜测是，当前代码中最大的改进将来自于使用 C 数组而不是 h 矩阵的列表列表。

回复收藏 0 原文

屋檐 2024-11-07 03:37:48

通过“cython -a”运行它，这将为您提供带有漂亮黄色注释行的 HTML 注释源版本。颜色越深，该行中发生的 Python 操作越多。这通常有助于查找耗时的对象转换等。

然而，我很确定最大的问题是你的数据结构。考虑使用 NumPy 数组而不是嵌套列表，或者仅使用动态分配的 C 内存块。

回复收藏 0 原文

橘味果▽酱 2024-11-07 03:37:48

如果您的搜索中返回了多个单词（如果您需要对输入字符串的相同值多次计算 Damerau Levenshtein Distance），您可以考虑使用字典（或哈希图）来缓存结果。下面是 C# 中的实现：

    private static Dictionary<int, Dictionary<int, int>> DamerauLevenshteinDictionary = new Dictionary<int, Dictionary<int, int>>();

    public static int DamerauLevenshteinDistanceWithDictionaryCaching(string word1, string word2)
    {
        Dictionary<int, int> word1Dictionary;

        if (DamerauLevenshteinDictionary.TryGetValue(word1.GetHashCode(), out word1Dictionary))
        {
            int distance;

            if (word1Dictionary.TryGetValue(word2.GetHashCode(), out distance))
            {
                // The distance is already in the dictionary
                return distance;
            }
            else
            {
                // The word1 has been found in the dictionary, but the matching with word2 hasn't been found.
                distance = DamerauLevenshteinDistance(word1, word2);
                DamerauLevenshteinDictionary[word1.GetHashCode()].Add(word2.GetHashCode(), distance);
                return distance;
            }
        }
        else
        {
            // The word1 hasn't been found in the dictionary, we must add an entry to the dictionary with that match.
            int distance = DamerauLevenshteinDistance(word1, word2);
            Dictionary<int, int> dictionaryToAdd = new Dictionary<int,int>();
            dictionaryToAdd.Add(word2.GetHashCode(), distance);
            DamerauLevenshteinDictionary.Add(word1.GetHashCode(), dictionaryToAdd);
            return distance;
        }
    }

If several words comes back in your search (if you need to calculate the Damerau Levenshtein Distance several times for the same value of the input strings), you can consider using a Dictionary (or hashmap) to cache your results. Here is an implementation in C#:

    private static Dictionary<int, Dictionary<int, int>> DamerauLevenshteinDictionary = new Dictionary<int, Dictionary<int, int>>();

    public static int DamerauLevenshteinDistanceWithDictionaryCaching(string word1, string word2)
    {
        Dictionary<int, int> word1Dictionary;

        if (DamerauLevenshteinDictionary.TryGetValue(word1.GetHashCode(), out word1Dictionary))
        {
            int distance;

            if (word1Dictionary.TryGetValue(word2.GetHashCode(), out distance))
            {
                // The distance is already in the dictionary
                return distance;
            }
            else
            {
                // The word1 has been found in the dictionary, but the matching with word2 hasn't been found.
                distance = DamerauLevenshteinDistance(word1, word2);
                DamerauLevenshteinDictionary[word1.GetHashCode()].Add(word2.GetHashCode(), distance);
                return distance;
            }
        }
        else
        {
            // The word1 hasn't been found in the dictionary, we must add an entry to the dictionary with that match.
            int distance = DamerauLevenshteinDistance(word1, word2);
            Dictionary<int, int> dictionaryToAdd = new Dictionary<int,int>();
            dictionaryToAdd.Add(word2.GetHashCode(), distance);
            DamerauLevenshteinDictionary.Add(word1.GetHashCode(), dictionaryToAdd);
            return distance;
        }
    }

回复收藏 0 原文

~没有更多了~