通过位屏蔽查找数据间隙

发布于 2024-10-06 02:23:17 字数 1547 浏览 1 评论 0原文

我面临着在数字序列中查找给定长度的不连续性(间隙)的问题。因此,例如,给定 [1,2,3,7,8,9,10]length=3 的间隙,我会找到 [4,5,6]。如果间隙是 length=4,我什么也找不到。当然,真正的序列要长得多。我在很多帖子中看到过这个问题,它有各种应用程序和可能的实现。

我认为可能有效并且应该相对较快的一种方法是将完整的集合表示为一个位数组,其中 1 表示可用数字,0 表示缺失 - 所以上面的代码看起来像 [1,1,1,0,0 ,0,1,1,1,1]。然后可能运行一个窗口函数,该函数将给定长度的数组与完整的集合进行异或屏蔽,直到所有位置的结果均为 1。这将需要在大约 ~O(n) 的时间内对整个序列进行一次传递,再加上每次运行中的掩蔽。

这是我设法想出的结果:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

这对于〜100k(<1秒)来说相当快。我希望得到有关如何使更大集合更快/更高效的提示。谢谢!

I'm faced with a problem of finding discontinuities (gaps) of a given length in a sequence of numbers. So, for example, given [1,2,3,7,8,9,10] and a gap of length=3, I'll find [4,5,6]. If the gap is length=4, I'll find nothing. The real sequence is, of course, much longer. I've seen this problem in quite a few posts, and it had various applications and possible implementations.

One way I thought might work and should be relatively quick is to represent the complete set as a bit array containing 1 for available number and 0 for missing - so the above will look like [1,1,1,0,0,0,1,1,1,1]. Then possibly run a window function that'll XOR mask an array of the given length with the complete set until all locations result in 1. This will require a single pass over the whole sequence in roughly ~O(n), plus the cost of masking in each run.

Here's what I managed to come up with:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

This is fairly quick for ~100k (< 1 sec). I'd appreciate tips on how to make this faster / more efficient for larger sets. thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

醉梦枕江山 2024-10-13 02:23:17

找出相邻数字之间的差异,然后寻找足够大的差异。我们通过构造两个列表(除了第一个之外的所有数字,以及除了最后一个之外的所有数字)并将它们两两相减来找到差异。我们可以使用 zip 将值配对。

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(另外,学习一些 Python 习惯用法。我们更喜欢直接迭代,并且我们有真正的布尔类型。)

Find the differences between adjacent numbers, and then look for a difference that's large enough. We find the differences by constructing two lists - all the numbers but the first, and all the numbers but the last - and subtracting them pairwise. We can use zip to pair the values up.

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(Also, please learn some Python idioms. We prefer direct iteration, and we have a real boolean type.)

苄①跕圉湢 2024-10-13 02:23:17

您可以使用 XOR 和移位,它确实会在大约 O(n) 时间内运行。

然而,在实践中,构建索引(大于某个最小长度的所有间隙的哈希列表)可能是更好的方法。

假设您从这些整数的序列(而不是位掩码)开始,然后只需遍历该序列即可构建索引;每当您发现间隙大于阈值时,您都会将该间隙大小添加到字典中(如有必要,将其实例化为空列表,然后在序列中附加偏移量。

最后,您将获得每个间隙的列表(大于 。

这种方法的一个好处是,您应该能够在修改基本列表时维护该索引,因此可以摊销构建索引所花费的 O(n*log(n)) 初始时间 后续查询和更新索引的成本为 O(log(n))

这是构建 gap_index() 的一个非常粗略的函数:

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

一个同时维护数据集和索引的类可能是最好的。围绕内置的“bisect”模块及其 insort() 函数构建。

You could use XOR and shift and it does run in roughly O(n) time.

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

温柔戏命师 2024-10-13 02:23:17

几乎是aix所做的......但只得到所需长度的间隙:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

编辑:根据OP的意愿进行调整。

Pretty much what aix did ... but getting only the gaps of the desired length:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

EDIT: Adjusted to the OP's wishes.

酒绊 2024-10-13 02:23:17

这些提供了输入列表的单步走。

给定长度的间隙值列表:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

所有间隙值:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

作为向量的间隙列表:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

请注意,这些是生成器,消耗很少的内存。
我很想知道它们在您的测试数据集上的表现如何。

These provide a single walk of your input list.

List of gap values for given length:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

All gap values:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

List of gaps as vectors:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

Note that these are generators and consume very little memory.
I would love to know how these perform on your test dataset.

非要怀念 2024-10-13 02:23:17

如果您追求的是效率,我会按照以下方式执行操作(其中 x 是序列号列表):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))

If it's efficiency you're after, I'd do something along the following lines (where x is the list of sequence numbers):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文