如何将记忆应用于该算法?
在发现 Python 标准库中的 difflib.SequenceMatcher
类不适合我的需求后,编写了一个通用的“diff”模块来解决问题空间。经过几个月的时间更多地思考它在做什么之后,递归算法似乎通过按单独的“搜索线程”也可能检查过的顺序重新搜索相同区域来搜索超出需要的搜索量。
diff 模块的目的是计算一对序列(列表、元组、字符串、字节、字节数组等)之间的差异和相似之处。初始版本比代码当前形式慢得多,速度提高了十倍。如何将记忆应用于以下代码?重写算法以进一步提高速度的最佳方法是什么?
class Slice:
__slots__ = 'prefix', 'root', 'suffix'
def __init__(self, prefix, root, suffix):
self.prefix = prefix
self.root = root
self.suffix = suffix
################################################################################
class Match:
__slots__ = 'a', 'b', 'prefix', 'suffix', 'value'
def __init__(self, a, b, prefix, suffix, value):
self.a = a
self.b = b
self.prefix = prefix
self.suffix = suffix
self.value = value
################################################################################
class Tree:
__slots__ = 'nodes', 'index', 'value'
def __init__(self, nodes, index, value):
self.nodes = nodes
self.index = index
self.value = value
################################################################################
def search(a, b):
# Initialize startup variables.
nodes, index = [], []
a_size, b_size = len(a), len(b)
# Begin to slice the sequences.
for size in range(min(a_size, b_size), 0, -1):
for a_addr in range(a_size - size + 1):
# Slice "a" at address and end.
a_term = a_addr + size
a_root = a[a_addr:a_term]
for b_addr in range(b_size - size + 1):
# Slice "b" at address and end.
b_term = b_addr + size
b_root = b[b_addr:b_term]
# Find out if slices are equal.
if a_root == b_root:
# Create prefix tree to search.
a_pref, b_pref = a[:a_addr], b[:b_addr]
p_tree = search(a_pref, b_pref)
# Create suffix tree to search.
a_suff, b_suff = a[a_term:], b[b_term:]
s_tree = search(a_suff, b_suff)
# Make completed slice objects.
a_slic = Slice(a_pref, a_root, a_suff)
b_slic = Slice(b_pref, b_root, b_suff)
# Finish the match calculation.
value = size + p_tree.value + s_tree.value
match = Match(a_slic, b_slic, p_tree, s_tree, value)
# Append results to tree lists.
nodes.append(match)
index.append(value)
# Return largest matches found.
if nodes:
return Tree(nodes, index, max(index))
# Give caller null tree object.
return Tree(nodes, index, 0)
参考: 如何优化递归算法使其不重复?
After finding the difflib.SequenceMatcher
class in Python's standard library to be unsuitable for my needs, a generic "diff"-ing module was written to solve a problem space. After having several months to think more about what it is doing, the recursive algorithm appears to be searching more than in needs to by re-searching the same areas in a sequence that a separate "search thread" may have also examined.
The purpose of the diff
module is to compute the difference and similarities between a pair of sequences (list, tuple, string, bytes, bytearray, et cetera). The initial version was much slower than the code's current form, having seen a speed increase by a factor of ten. How can memoization be applied to the following code? What is the best way to rewrite the algorithm to further increase any possible speed?
class Slice:
__slots__ = 'prefix', 'root', 'suffix'
def __init__(self, prefix, root, suffix):
self.prefix = prefix
self.root = root
self.suffix = suffix
################################################################################
class Match:
__slots__ = 'a', 'b', 'prefix', 'suffix', 'value'
def __init__(self, a, b, prefix, suffix, value):
self.a = a
self.b = b
self.prefix = prefix
self.suffix = suffix
self.value = value
################################################################################
class Tree:
__slots__ = 'nodes', 'index', 'value'
def __init__(self, nodes, index, value):
self.nodes = nodes
self.index = index
self.value = value
################################################################################
def search(a, b):
# Initialize startup variables.
nodes, index = [], []
a_size, b_size = len(a), len(b)
# Begin to slice the sequences.
for size in range(min(a_size, b_size), 0, -1):
for a_addr in range(a_size - size + 1):
# Slice "a" at address and end.
a_term = a_addr + size
a_root = a[a_addr:a_term]
for b_addr in range(b_size - size + 1):
# Slice "b" at address and end.
b_term = b_addr + size
b_root = b[b_addr:b_term]
# Find out if slices are equal.
if a_root == b_root:
# Create prefix tree to search.
a_pref, b_pref = a[:a_addr], b[:b_addr]
p_tree = search(a_pref, b_pref)
# Create suffix tree to search.
a_suff, b_suff = a[a_term:], b[b_term:]
s_tree = search(a_suff, b_suff)
# Make completed slice objects.
a_slic = Slice(a_pref, a_root, a_suff)
b_slic = Slice(b_pref, b_root, b_suff)
# Finish the match calculation.
value = size + p_tree.value + s_tree.value
match = Match(a_slic, b_slic, p_tree, s_tree, value)
# Append results to tree lists.
nodes.append(match)
index.append(value)
# Return largest matches found.
if nodes:
return Tree(nodes, index, max(index))
# Give caller null tree object.
return Tree(nodes, index, 0)
Reference: How to optimize a recursive algorithm to not repeat itself?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如 ~unutbu 所说,尝试 memoized 装饰器 和以下更改:
对于 memoization,字典是最好的,但它们不能被切片,因此必须将它们更改为上面评论中所示的列表。
As ~unutbu said, try the memoized decorator and the following changes:
For memoization, dictionaries are best, but they cannot be sliced, so they have to be changed to lists as indicated in the comments above.
您可以使用 Python 装饰器库 中的 memoize 装饰器
并像这样使用它:
第一次使用参数
a,b
调用search
时,将计算并记忆结果(保存在缓存中)。第二次使用相同的参数调用search
时,结果将从缓存中返回。请注意,要使 memoized 装饰器正常工作,参数必须是可散列的。如果
a
和b
是数字元组,那么它们是可散列的。如果它们是列表,那么您可以在将它们传递给搜索
之前将它们转换为元组。看起来
search
并不采用dicts
作为参数,但如果是的话,那么 它们不可散列并且记忆装饰器将无法将结果保存在缓存中。You could use the memoize decorator from the Python Decorator Library
and use it like this:
The first time you call
search
with argumentsa,b
, the result is calculated and memoized (saved in a cache). The second timesearch
is called with the same arguments, the result in returned from the cache.Note that for the
memoized
decorator to work, the arguments must be hashable. Ifa
andb
are tuples of numbers, then they are hashable. If they are lists then you could convert them to tuples before passing them tosearch
.It doesn't look like
search
takesdicts
as arguments, but if they were, then they would not be hashable and the memoization decorator would not be able to save the result in the cache.距离提出这个问题已经过去了 9 年多了,但是内部缓存结果以加速算法的概念今天终于应用到了代码中。该应用程序的结果如下所示:
It has been over 9 years since the question was asked, but the concept of internally caching results to speed up the algorithm was finally applied to the code today. The results of this application can be seen below: