python 的 strcmp 或如何在构建后缀数组时有效地对子字符串进行排序（无需复制）

发布于 2024-08-22 06:52:47 字数 432 浏览 6 评论 0原文

这是从 python 中的字符串构建后缀数组的非常简单的方法：

def sort_offsets(a, b):
    return cmp(content[a:], content[b:])

content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

但是，“content[ a:]" 会复制内容，当内容变大时，这会变得非常低效。所以我想知道是否有一种方法可以比较两个子字符串而不必复制它们。我尝试过使用内置缓冲区，但没有成功。

原文

Here's a very simple way to build an suffix array from a string in python:

def sort_offsets(a, b):
    return cmp(content[a:], content[b:])

content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

However, "content[a:]" makes a copy of content, which becomes very inefficient when content gets large. So i wonder if there's a way to compare the two substrings without having to copy them. I've tried to use the buffer-builtin, but it didn't worked.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

趴在窗边数星星i 2024-08-29 06:52:47

buffer 函数不会复制整个字符串，而是创建一个对象仅引用源字符串。使用 interjay 的建议，那就是：

suffix_array.sort(key=lambda a: buffer(content, a))

The buffer function does not copy the whole string, but creates an object that only references the source string. Using interjay's suggestion, that would be:

suffix_array.sort(key=lambda a: buffer(content, a))

回复收藏 0 原文

夜深人未静 2024-08-29 06:52:47

我不知道是否有一种快速的方法来比较子字符串，但是您可以通过使用 key 而不是 cmp 来使代码更快（更简单）：

suffix_array.sort(key=lambda a: content[a:])

这将创建对于 a 的每个值，只使用一次子字符串。

编辑：一个可能的缺点是子字符串需要 O(n^2) 内存。

I don't know if there's a fast way to compare substrings, but you can make your code much faster (and simpler) by using key instead of cmp:

suffix_array.sort(key=lambda a: content[a:])

This will create the substring just once for each value of a.

Edit: A possible downside is that it will require O(n^2) memory for the substrings.

回复收藏 0 原文

凤舞天涯 2024-08-29 06:52:47

一个非常有趣的问题+1！我看不到任何明显的方法可以直接执行此操作，但通过使用以下比较函数代替您的函数，我能够获得显着的加速（100000 个字符串的数量级）：

def compare_offsets2(a, b):
    return (cmp(content[a:a+10], content[b:b+10]) or
            cmp(content[a:], content[b:]))

换句话说，从比较开始每个后缀的前 10 个字符；仅当比较的结果为 0（表明前 10 个字符匹配）时，您才继续比较整个 suffices。

显然 10 可以是任何值：通过实验找到最佳值。

这个比较函数也是一个很好的例子，它不容易被关键函数取代。

+1 for a very interesting problem! I can't see any obvious way to do this directly, but I was able to get a significant speedup (an order of magnitude for 100000 character strings) by using the following comparison function in place of yours:

def compare_offsets2(a, b):
    return (cmp(content[a:a+10], content[b:b+10]) or
            cmp(content[a:], content[b:]))

In other words, start by comparing the first 10 characters of each suffix; only if the result of that comparison is 0, indicating that you've got a match for the first 10 characters, do you go on to compare the entire suffices.

Obviously 10 could be anything: experiment to find the best value.

This comparison function is also a nice example of something that isn't easily replaced with a key function.

回复收藏 0 原文

野生奥特曼 2024-08-29 06:52:47

您可以使用我编写的 blist 扩展类型。 blist 的工作方式与内置 list 类似，但（除其他外）使用写时复制，以便获取切片需要 O(log n) 时间和内存。

from blist import blist

content = "foobar baz foo"
content = blist(content)
suffix_array = range(len(content))
suffix_array.sort(key = lambda a: content[a:])
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

我能够在 5 秒内从随机生成的 100,000 个字符的字符串创建一个 suffix_array，其中包括生成字符串。

You could use the blist extension type that I wrote. A blist works like the built-in list, but (among other things) uses copy-on-write so that taking a slice takes O(log n) time and memory.

from blist import blist

content = "foobar baz foo"
content = blist(content)
suffix_array = range(len(content))
suffix_array.sort(key = lambda a: content[a:])
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

I was able to create a suffix_array from a randomly generated 100,000-character string in under 5 seconds, and that includes generating the string.

回复收藏 0 原文

~没有更多了~