python 的 strcmp 或如何在构建后缀数组时有效地对子字符串进行排序(无需复制)
这是从 python 中的字符串构建 后缀数组 的非常简单的方法:
def sort_offsets(a, b):
return cmp(content[a:], content[b:])
content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]
但是,“content[ a:]" 会复制内容,当内容变大时,这会变得非常低效。所以我想知道是否有一种方法可以比较两个子字符串而不必复制它们。我尝试过使用内置缓冲区,但没有成功。
Here's a very simple way to build an suffix array from a string in python:
def sort_offsets(a, b):
return cmp(content[a:], content[b:])
content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]
However, "content[a:]" makes a copy of content, which becomes very inefficient when content gets large. So i wonder if there's a way to compare the two substrings without having to copy them. I've tried to use the buffer-builtin, but it didn't worked.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
buffer 函数不会复制整个字符串,而是创建一个对象仅引用源字符串。使用 interjay 的建议,那就是:
The buffer function does not copy the whole string, but creates an object that only references the source string. Using interjay's suggestion, that would be:
我不知道是否有一种快速的方法来比较子字符串,但是您可以通过使用
key
而不是cmp
来使代码更快(更简单):这将创建对于 a 的每个值,只使用一次子字符串。
编辑:一个可能的缺点是子字符串需要 O(n^2) 内存。
I don't know if there's a fast way to compare substrings, but you can make your code much faster (and simpler) by using
key
instead ofcmp
:This will create the substring just once for each value of a.
Edit: A possible downside is that it will require O(n^2) memory for the substrings.
一个非常有趣的问题+1!我看不到任何明显的方法可以直接执行此操作,但通过使用以下比较函数代替您的函数,我能够获得显着的加速(100000 个字符串的数量级):
换句话说,从比较开始每个后缀的前 10 个字符;仅当比较的结果为 0(表明前 10 个字符匹配)时,您才继续比较整个 suffices。
显然 10 可以是任何值:通过实验找到最佳值。
这个比较函数也是一个很好的例子,它不容易被关键函数取代。
+1 for a very interesting problem! I can't see any obvious way to do this directly, but I was able to get a significant speedup (an order of magnitude for 100000 character strings) by using the following comparison function in place of yours:
In other words, start by comparing the first 10 characters of each suffix; only if the result of that comparison is 0, indicating that you've got a match for the first 10 characters, do you go on to compare the entire suffices.
Obviously 10 could be anything: experiment to find the best value.
This comparison function is also a nice example of something that isn't easily replaced with a key function.
您可以使用我编写的 blist 扩展类型。
blist
的工作方式与内置list
类似,但(除其他外)使用写时复制,以便获取切片需要 O(log n) 时间和内存。我能够在 5 秒内从随机生成的 100,000 个字符的字符串创建一个 suffix_array,其中包括生成字符串。
You could use the blist extension type that I wrote. A
blist
works like the built-inlist
, but (among other things) uses copy-on-write so that taking a slice takes O(log n) time and memory.I was able to create a suffix_array from a randomly generated 100,000-character string in under 5 seconds, and that includes generating the string.