Cython Damerau-Levenshtein 加速
我有以下 cython 实现,基于 计算 2 个字符串的 Damerau–Levenshtein 距离这篇维基百科文章,但目前它对于我的需求来说太慢了。我有一个大约 600000 个字符串的列表,我必须在该列表中找到拼写错误。
如果有人能建议任何算法改进或一些可以减少脚本运行时间的 python/cython 魔法,我会很高兴。我并不关心它使用了多少空间,只关心计算所需的时间。
根据使用大约 2000 个字符串对脚本进行分析,它在 damerauLevenshteinDistance
函数中花费了整个运行时间的 80%(30 秒中的 24 秒),而我完全不知道如何使其更快。
def damerauLevenshteinDistance(a, b, h):
"""
a = source sequence
b = comparing sequence
h = matrix to store the metrics (currently nested list)
"""
cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
alphabet = getAlphabet((a,b))
lena = len(a)
lenb = len(b)
inf = lena + lenb + 1
da = [0 for x in xrange(0, len(alphabet))]
for i in xrange(1, lena+1):
db = 0
for j in xrange(1, lenb+1):
i1 = da[alphabet[b[j-1]]]
j1 = db
d = 1
if (a[i-1] == b[j-1]):
d = 0
db = j
h[i+1][j+1] = min(
h[i][j]+d,
h[i+1][j]+1,
h[i][j+1]+1,
h[i1][j1]+(i-i1-1)+1+(j-j1-1)
)
da[alphabet[a[i-1]]] = i
return h[lena+1][lenb+1]
cdef getAlphabet(words):
"""
construct an alphabet out of the lists found in the tuple words with a
sequential identifier for each word
"""
cdef int i
alphabet = {}
i = 0
for wordList in words:
for letter in wordList:
if letter not in alphabet:
alphabet[letter] = i
i += 1
return alphabet
I have the following cython implementation of calculating the Damerau–Levenshtein distance of 2 strings, based on this Wikipedia article, but currently it is too slow for my needs. I have a list of about 600000 strings and I have to find typos in that list.
I would be glad if anyone could suggest any algorithmic improvements or some python/cython magic that could reduce the runtime of the script. I don't really care about how much space it uses only the time it takes to calculate.
According to profiling the script using about 2000 strings it spends 80% of the complete runtime (24 of 30 sec) in the damerauLevenshteinDistance
function, and I'm all out of ideas how to make it faster.
def damerauLevenshteinDistance(a, b, h):
"""
a = source sequence
b = comparing sequence
h = matrix to store the metrics (currently nested list)
"""
cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
alphabet = getAlphabet((a,b))
lena = len(a)
lenb = len(b)
inf = lena + lenb + 1
da = [0 for x in xrange(0, len(alphabet))]
for i in xrange(1, lena+1):
db = 0
for j in xrange(1, lenb+1):
i1 = da[alphabet[b[j-1]]]
j1 = db
d = 1
if (a[i-1] == b[j-1]):
d = 0
db = j
h[i+1][j+1] = min(
h[i][j]+d,
h[i+1][j]+1,
h[i][j+1]+1,
h[i1][j1]+(i-i1-1)+1+(j-j1-1)
)
da[alphabet[a[i-1]]] = i
return h[lena+1][lenb+1]
cdef getAlphabet(words):
"""
construct an alphabet out of the lists found in the tuple words with a
sequential identifier for each word
"""
cdef int i
alphabet = {}
i = 0
for wordList in words:
for letter in wordList:
if letter not in alphabet:
alphabet[letter] = i
i += 1
return alphabet
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我最近刚刚开源了 Damerau-Levenshtein 算法的 Cython 实现。我包括 pyx 和 C 源代码。
https://github.com/gfairchild/pyxDamerauLevenshtein
I just recently open-sourced a Cython implementation of the Damerau-Levenshtein algorithm. I include both the pyx and C source.
https://github.com/gfairchild/pyxDamerauLevenshtein
至少对于较长的字符串,您应该通过使用不同的算法来获得更好的性能,该算法不必计算 lena⋅lenb 矩阵中的所有值。例如,通常可能不需要计算矩阵的
[lena][0]
角的确切成本,它表示从删除a.
更好的算法可能是始终查看迄今为止计算出的权重最低的点,然后从那里向各个方向更进一步。这样您就可以到达目标位置,而无需检查矩阵中的所有位置:
该算法的实现可以使用优先级队列,如下所示:
这只是一个粗略的实现,可以进行很大改进:
heapq模块
At least for longer strings you should get better performance by using a different algorithm that doesn't have to calculate all the values in the lena⋅lenb Matrix. For example it might often not be necessary to calculate the exact cost of the
[lena][0]
corner of the matrix, which represents the cost of starting by deleting all characters ina
.A better algorithm might be to always look at the point with the lowest weight calculated so far, and then go one step further in all directions from there. This way you might reach the target location without examining all locations in the matrix:
An implementation of this algorithm could use a priority queue and would look like this:
This is just a rough implementation, it could be improved a lot:
heapq
module看起来您可以静态地输入比当前更多的代码,这会提高速度。
您还可以查看 Cython 中 Levenshtein Distance 的实现作为示例:
http://hackmap.blogspot.com/2008/04/levenshtein-in -cython.html
It seems like you could statically type more of your code than you currently are, which would increase the speed.
You might also check out an implementation of the Levenshtein Distance in Cython as an example:
http://hackmap.blogspot.com/2008/04/levenshtein-in-cython.html
我的猜测是,当前代码中最大的改进将来自于使用 C 数组而不是
h
矩阵的列表列表。My guess would be that the biggest improvement in your current code would come from using a C array instead of a list of lists for the
h
matrix.通过“cython -a”运行它,这将为您提供带有漂亮黄色注释行的 HTML 注释源版本。颜色越深,该行中发生的 Python 操作越多。这通常有助于查找耗时的对象转换等。
然而,我很确定最大的问题是你的数据结构。考虑使用 NumPy 数组而不是嵌套列表,或者仅使用动态分配的 C 内存块。
Run it through "cython -a", that will give you an HTML annotated source version with nicely yellow annotated lines. The darker the colour, the more Python operations are happening in that line. That usually helps quite a bit in finding time consuming object conversions and the like.
However, I'm pretty sure it will turn out that the biggest problem is your data structure. Consider using NumPy arrays instead of nested lists, or just use a dynamically allocated C memory block.
如果您的搜索中返回了多个单词(如果您需要对输入字符串的相同值多次计算 Damerau Levenshtein Distance),您可以考虑使用字典(或哈希图)来缓存结果。下面是 C# 中的实现:
If several words comes back in your search (if you need to calculate the Damerau Levenshtein Distance several times for the same value of the input strings), you can consider using a Dictionary (or hashmap) to cache your results. Here is an implementation in C#: