使用缓冲区的最长公共前缀？

发布于 2024-12-14 19:15:25 字数 743 浏览 0 评论 0原文

如果我有一个输入字符串和一个数组：

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

我试图找到引用原始 s 的数组 pos 的连续元素之间的最长公共前缀。我试图获得以下输出：

longest = [3,1]

我获得此结果的方式是通过计算以下对的最长公共前缀：

s[15:]，即_be和< code>s[2:] 是 _be_or_not_to_be 给出 3 ( _be )
s[2:] 是_be_or_not_to_be 和 s[8:] ，其中 _not_to_be 给出 1 ( _ )

但是，如果 s 很大，当我执行诸如 s[x:] 之类的操作时，我不想创建多个副本。经过几个小时的搜索，我发现函数 buffer 只维护输入字符串的一份副本，但我没有不确定在这种情况下使用它的最有效方法是什么。关于如何实现这一目标有什么建议吗？

原文

If I have an input string and an array:

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

I am trying to find the longest common prefix between the consecutive elements of the array pos referencing the original s. I am trying to get the following output:

longest = [3,1]

The way I obtained this is by computing the longest common prefix of the following pairs:

s[15:] which is _be and s[2:] which is _be_or_not_to_be giving 3 ( _be )
s[2:] which is _be_or_not_to_be and s[8:] which is _not_to_be giving 1 ( _ )

However, if s is huge, I don't want to create multiple copies when I do something like s[x:]. After hours of searching, I found the function buffer that maintains only one copy of the input string but I wasn't sure what is the most efficient way to utilize it here in this context. Any suggestions on how to achieve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

偷得浮生 2024-12-21 19:15:25

这是一种没有 buffer 的方法，它不会复制，因为它一次只查看一个字符：

from itertools import islice, izip

s = "to_be_or_not_to_be"
pos = [15, 2, 8]


length = len(s)    

for start1, start2 in izip(pos, islice(pos, 1, None)):
    pref = 0
    for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
        if s[pos1] == s[pos2]:
            pref += 1
        else:
            break
    print pref
# prints 3 1

我使用 islice、izip，和 xrange 如果您正在谈论可能非常长的字符串。

我也无法抗拒这个“One Liner”，它甚至不需要任何索引：

[next((i for i, (a, b) in 
    enumerate(izip(islice(s, start1, None), islice(s, start2, None))) 
        if a != b), 
    length - max((start1, start2))) 
 for start1, start2 in izip(pos, islice(pos, 1, None))]

最后一种方法，使用 os.path.commonprefix：

[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]

Here is a method without buffer which doesn't copy, as it only looks at one character at a time:

from itertools import islice, izip

s = "to_be_or_not_to_be"
pos = [15, 2, 8]


length = len(s)    

for start1, start2 in izip(pos, islice(pos, 1, None)):
    pref = 0
    for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
        if s[pos1] == s[pos2]:
            pref += 1
        else:
            break
    print pref
# prints 3 1

I use islice, izip, and xrange in case you're talking about potentially very long strings.

I also couldn't resist this "One Liner" which doesn't even require any indexing:

[next((i for i, (a, b) in 
    enumerate(izip(islice(s, start1, None), islice(s, start2, None))) 
        if a != b), 
    length - max((start1, start2))) 
 for start1, start2 in izip(pos, islice(pos, 1, None))]

One final method, using os.path.commonprefix:

[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]

回复收藏 0 原文

戒ㄋ 2024-12-21 19:15:25

>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'

让 Python 为你管理内存。不要过早优化。

要获得确切的输出，您可以执行以下操作（如 @agf 建议） :

print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
       for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]

>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'

Let Python to manage memory for you. Don't optimize prematurely.

To get the exact output you could do (as @agf suggested):

print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
       for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]

回复收藏 0 原文

ぃ弥猫深巷。 2024-12-21 19:15:25

我认为你对副本的担心是没有根据的。请参见下文：

>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True

除非您复制切片并留下对副本的引用，否则我认为这不会导致任何问题。

编辑：有一些关于字符串实习的技术细节和我自己不太清楚的东西。但我确信字符串切片并不总是副本：

>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True

我想我想要给出的答案是让python自己管理它的内存，首先，你可以看看内存缓冲区和视图稍后如果需要的话。如果这已经是您遇到的真正问题，请用实际问题的详细信息更新您的问题。

I think your worrying about copies is unfounded. See below:

>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True

Unless you're copying the slices and leaving references to the copies hanging around, I wouldn't think it could cause any problem.

edit: There are technical details with string interning and stuff that I'm not really clear on myself. But I'm sure that a string slice is not always a copy:

>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True

I guess the answer I'm trying to give is to just let python manage its memory itself, to begin with, you can look at memory buffers and views later if needed. And if this is already a real problem occurring for you, update your question with details of what the actual problem is.

回复收藏 0 原文

痴梦一场 2024-12-21 19:15:25

下面给出了使用缓冲区的一种方法。然而，可能还有更快的方法。

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

lcp = []
length = len(pos) - 1

for index in range(0, length):
    pre = buffer(s, pos[index])
    cur = buffer(s, pos[index+1], pos[index+1]+len(pre))

    count = 0

    shorter, longer = min(pre, cur), max(pre, cur)

    for i, c in enumerate(shorter):
        if c != longer[i]:
            break
        else:
            count += 1

    lcp.append(count)
    print 

print lcp

One way of doing using buffer this is give below. However, there could be much faster ways.

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

lcp = []
length = len(pos) - 1

for index in range(0, length):
    pre = buffer(s, pos[index])
    cur = buffer(s, pos[index+1], pos[index+1]+len(pre))

    count = 0

    shorter, longer = min(pre, cur), max(pre, cur)

    for i, c in enumerate(shorter):
        if c != longer[i]:
            break
        else:
            count += 1

    lcp.append(count)
    print 

print lcp

回复收藏 0 原文

~没有更多了~