连接一组有序整数生成 Python 迭代器

发布于 2024-07-22 21:25:57 字数 772 浏览 9 评论 0原文

这是一个看似简单的问题：给定一个按升序生成整数序列的迭代器列表，编写一个简洁的生成器，仅生成每个序列中出现的整数。

昨晚读了几篇论文后，我决定用 Python 编写一个完全最小的全文索引器，如此处所示（尽管该版本现在已经很旧了）。

我的问题是 search() 函数，它必须迭代每个发布列表并仅生成每个列表中出现的文档 ID。正如您从上面的链接中看到的，我当前的非递归“工作”尝试非常糟糕。

示例：

postings = [[1,   100, 142, 322, 12312],
            [2,   100, 101, 322, 1221],
            [100, 142, 322, 956, 1222]]

应该产生：

[100, 322]

至少有一个优雅的递归函数解决方案，但如果可能的话我想避免这种情况。然而，涉及嵌套生成器表达式、itertools 滥用或任何其他类型的代码高尔夫的解决方案是非常受欢迎的。 :-)

应该可以安排函数只需要与最小列表中的项目一样多的步骤，而不会将整个整数集吸入内存。将来，这些列表可能会从磁盘读取，并且会大于可用 RAM。

在过去的 30 分钟里，我的脑海里浮现出一个想法，但我无法将其转化为代码。请记住，这只是为了好玩！

原文

Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.

After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).

My problem is with the search() function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.

Example:

postings = [[1,   100, 142, 322, 12312],
            [2,   100, 101, 322, 1221],
            [100, 142, 322, 956, 1222]]

Should yield:

[100, 322]

There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools abuse, or any other kind of code golf is more than welcome. :-)

It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.

For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

_失温 2024-07-29 21:25:57

import heapq, itertools
def intersect(*its):
    for key, values in itertools.groupby(heapq.merge(*its)):
        if len(list(values)) == len(its):
            yield key

>>> list(intersect(*postings))
[100, 322]

import heapq, itertools
def intersect(*its):
    for key, values in itertools.groupby(heapq.merge(*its)):
        if len(list(values)) == len(its):
            yield key

>>> list(intersect(*postings))
[100, 322]

回复收藏 0 原文

身边 2024-07-29 21:25:57

def postings(posts):
    sets = (set(l) for l in posts)
    return sorted(reduce(set.intersection, sets))

...你可以尝试利用列表是有序的这一事实，但由于reduce、生成器表达式和set都是用C实现的，因此你可能很难用python实现的逻辑做得比上面更好。

def postings(posts):
    sets = (set(l) for l in posts)
    return sorted(reduce(set.intersection, sets))

... you could try and take advantage of the fact that the lists are ordered, but since reduce, generator expressions and set are all implemented in C, you'll probably have a hard time doing better than the above with logic implemented in python.

回复收藏 0 原文

一江春梦 2024-07-29 21:25:57

该解决方案将计算迭代器的交集。它的工作原理是一次推进迭代器一步，并在所有迭代器中寻找相同的值。当找到时，就会产生这样的值——这使得 intersect 函数本身成为生成器。

import operator

def intersect(sequences):
    """Compute intersection of sequences of increasing integers.

    >>> list(intersect([[1,   100, 142, 322, 12312],
    ...                 [2,   100, 101, 322, 1221],
    ...                 [100, 142, 322, 956, 1222]]))
    [100, 322]
    """
    iterators = [iter(seq) for seq in sequences]
    last = [iterator.next() for iterator in iterators]
    indices = range(len(iterators) - 1)
    while True:
        # The while loop stops when StopIteration is raised. The
        # exception will also stop the iteration by our caller.
        if reduce(operator.and_, [l == last[0] for l in last]):
            # All iterators contain last[0]
            yield last[0]
            last = [iterator.next() for iterator in iterators]

        # Now go over the iterators once and advance them as
        # necessary. To stop as soon as the smallest iterator is
        # exhausted we advance each iterator only once per iteration
        # in the while loop.
        for i in indices:
            if last[i] < last[i+1]:
                last[i] = iterators[i].next()
            if last[i] > last[i+1]:
                last[i+1] = iterators[i+1].next()

This solution will compute the intersection of your iterators. It works by advancing the iterators one step at a time and looking for the same value in all of them. When found, such values are yielded -- this makes the intersect function a generator itself.

import operator

def intersect(sequences):
    """Compute intersection of sequences of increasing integers.

    >>> list(intersect([[1,   100, 142, 322, 12312],
    ...                 [2,   100, 101, 322, 1221],
    ...                 [100, 142, 322, 956, 1222]]))
    [100, 322]
    """
    iterators = [iter(seq) for seq in sequences]
    last = [iterator.next() for iterator in iterators]
    indices = range(len(iterators) - 1)
    while True:
        # The while loop stops when StopIteration is raised. The
        # exception will also stop the iteration by our caller.
        if reduce(operator.and_, [l == last[0] for l in last]):
            # All iterators contain last[0]
            yield last[0]
            last = [iterator.next() for iterator in iterators]

        # Now go over the iterators once and advance them as
        # necessary. To stop as soon as the smallest iterator is
        # exhausted we advance each iterator only once per iteration
        # in the while loop.
        for i in indices:
            if last[i] < last[i+1]:
                last[i] = iterators[i].next()
            if last[i] > last[i+1]:
                last[i+1] = iterators[i+1].next()

回复收藏 0 原文

终止放荡 2024-07-29 21:25:57

如果这些序列确实很长（甚至无限），并且您不想提前将所有内容加载到集合中，则可以在每个迭代器上使用 1 项前瞻来实现此目的。

EndOfIter = object() # Sentinel value

class PeekableIterator(object):
    def __init__(self, it):
        self.it = it
        self._peek = None
        self.next() # pump iterator to get first value

    def __iter__(self): return self

    def next(self):
        cur = self._peek
        if cur is EndOfIter:
            raise StopIteration()

        try:
            self._peek = self.it.next()
        except StopIteration:
            self._peek = EndOfIter
        return cur

    def peek(self): 
        return self._peek


def contained_in_all(seqs):
   if not seqs: return   # No items
   iterators = [PeekableIterator(iter(seq)) for seq in seqs]
   first, rest = iterators[0], iterators[1:]

   for item in first:
       candidates = list(rest)
       while candidates:
           if any(c.peek() is EndOfIter for c in candidates): return  # Exhausted an iterator
           candidates = [c for c in candidates if c.peek() < item]
           for c in candidates: c.next()

       # Out of loop if first item in remaining iterator are all >= item.
       if all(it.peek() == item for it in rest):
           yield item

用法：

>>> print list(contained_in_all(postings))
[100, 322]

If these are really long (or even infinite) sequences, and you don't want to load everything into a set in advance, you can implement this with a 1-item lookahead on each iterator.

EndOfIter = object() # Sentinel value

class PeekableIterator(object):
    def __init__(self, it):
        self.it = it
        self._peek = None
        self.next() # pump iterator to get first value

    def __iter__(self): return self

    def next(self):
        cur = self._peek
        if cur is EndOfIter:
            raise StopIteration()

        try:
            self._peek = self.it.next()
        except StopIteration:
            self._peek = EndOfIter
        return cur

    def peek(self): 
        return self._peek


def contained_in_all(seqs):
   if not seqs: return   # No items
   iterators = [PeekableIterator(iter(seq)) for seq in seqs]
   first, rest = iterators[0], iterators[1:]

   for item in first:
       candidates = list(rest)
       while candidates:
           if any(c.peek() is EndOfIter for c in candidates): return  # Exhausted an iterator
           candidates = [c for c in candidates if c.peek() < item]
           for c in candidates: c.next()

       # Out of loop if first item in remaining iterator are all >= item.
       if all(it.peek() == item for it in rest):
           yield item

Usage:

>>> print list(contained_in_all(postings))
[100, 322]

回复收藏 0 原文

狼亦尘 2024-07-29 21:25:57

这个怎么样：

import heapq

def inalliters(iterators):
  heap=[(iterator.next(),iterator) for iterator in iterators]
  heapq.heapify(heap)
  maximal = max(heap)[0]
  while True:
    value,iterator = heapq.heappop(heap)
    if maximal==value: yield value
    nextvalue=iterator.next()
    heapq.heappush(heap,(nextvalue,iterator))
    maximal=max(maximal,nextvalue)

postings = [iter([1,   100, 142, 322, 12312]),
            iter([2,   100, 101, 322, 1221]),
            iter([100, 142, 322, 956, 1222])]
print [x for x in inalliters(postings)]

我还没有非常彻底地测试它（只是运行了你的例子），但我相信基本的想法是合理的。

What about this:

import heapq

def inalliters(iterators):
  heap=[(iterator.next(),iterator) for iterator in iterators]
  heapq.heapify(heap)
  maximal = max(heap)[0]
  while True:
    value,iterator = heapq.heappop(heap)
    if maximal==value: yield value
    nextvalue=iterator.next()
    heapq.heappush(heap,(nextvalue,iterator))
    maximal=max(maximal,nextvalue)

postings = [iter([1,   100, 142, 322, 12312]),
            iter([2,   100, 101, 322, 1221]),
            iter([100, 142, 322, 956, 1222])]
print [x for x in inalliters(postings)]

I haven't tested it very thoroughly (just ran your example), but I believe the basic idea is sound.

回复收藏 0 原文

情定在深秋 2024-07-29 21:25:57

我想证明有一个优雅的解决方案，它仅向前迭代一次。抱歉，我不太了解 Python，所以我使用虚构的类。它读取input（一个迭代器数组），并即时写入output，而无需返回或使用任何数组函数！

    def intersect (input, output) 
        do:
            min = input[0]
            bingo = True
            for i in input:
                if (i.cur < min.cur):
                     bingo = False
                     min =  i
            if bingo: 
                output.push(min.cur)
        while (min.step())

I want to show that there's an elegant solution, which only iterates forward once. Sorry, I don't know the Python well enough, so I use fictional classes. This one reads input, an array of iterators, and writes to output on-the-fly without ever going back or using any array function!.

    def intersect (input, output) 
        do:
            min = input[0]
            bingo = True
            for i in input:
                if (i.cur < min.cur):
                     bingo = False
                     min =  i
            if bingo: 
                output.push(min.cur)
        while (min.step())

回复收藏 0 原文

隐诗 2024-07-29 21:25:57

这个运行时间为 O(n*m)，其中 n 是所有迭代器长度的总和，m 是列表的数量。通过使用第 6 行中的堆，可以将其变为 O(n*logm)。

def intersection(its):
  if not its: return
  vs = [next(it) for it in its]
  m = max(vs)
  while True:
    v, i = min((v,i) for i,v in enumerate(vs))
    if v == m:
      yield m
    vs[i] = next(its[i])
    m = max(m, vs[i])

This one runs in O(n*m) where n is the sum of all iterator lengths, and m is the number of lists. It can be made O(n*logm) by using a heap in line 6.

def intersection(its):
  if not its: return
  vs = [next(it) for it in its]
  m = max(vs)
  while True:
    v, i = min((v,i) for i,v in enumerate(vs))
    if v == m:
      yield m
    vs[i] = next(its[i])
    m = max(m, vs[i])

回复收藏 0 原文

~没有更多了~