解析一个觉得而不列出每个块

发布于 2025-01-25 21:41:48 字数 2136 浏览 5 评论 0原文

假设我想实现python的分裂，而不清单每个块，类似于itertools.groupbys.groupby，其块很懒惰。但是我想在更复杂的状态下做到这一点，而不是钥匙的平等。更像是解析器。

例如，假设我想将奇数数字用作整数的划界器。像more_itertools.split_at（lambda x：x％2 == 1，xs）。（但是more_itertools.split_at列表每个块。）

在解析器组合语言中，这可以称为sepby1（奇数，许多（偶数））。在Haskell中，有parsec，pipes-parse和pipes-group库，可以解决这种问题。例如，编写itertools.groupby类似于groupsby'的版本是足够且有趣hackage.haskell.org/package/pipes-group-1.0.12/docs/pipes-group.html#:%7e:text = 333%7c4%7c5%22-22-,groupsby，“ rel =” nofollow noreferrer'>在这里）。

可能会有一些带有itertools.groupby的巧妙的jiu jitsu，也许可以应用itertools.pairwise，然后itertools.groupbys.groupby，然后返回到返回单元素。

我想我可以自己写作作为生成器，但是在python（下）中编写itertools.groupby已经涉及。也不容易推广。

似乎应该更普遍地有一些东西，例如一种相对毫无痛苦的方式编写解析器和合并器，以供任何类型的流。

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

原文

Suppose I want to achieve a splitting of a Python iterable, without listifying each chunk, similar to itertools.groupby, whose chunks are lazy. But I want to do it on a more sophisticated condition than equality of a key. So more like a parser.

For example, suppose I want to use odd numbers as delimiters in an iterable of integers. Like more_itertools.split_at(lambda x: x % 2 == 1, xs). (But more_itertools.split_at listifies each chunk.)

In parser combinator language this might be called sepBy1(odd, many(even)). In Haskell there are the Parsec, pipes-parse and pipes-group libraries which address this kind of problem. For instance, it would be sufficient and interesting to write an itertools.groupby-like version of groupsBy' from Pipes.Group (see here).

There could probably be some clever jiu jitsu with itertools.groupby, perhaps applying itertools.pairwise, then itertools.groupby, and then going back to single elements.

I could write it myself as a generator, I suppose, but writing itertools.groupby in Python (below) is already pretty involved. Also not readily generalizable.

Seems like there should be something for this more generally, like a relatively painless way of writing parsers and combinators for streams of whatever type.

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

染火枫林 2025-02-01 21:41:48

这是几个简单的迭代器拆分器，我很无聊。我认为它们不是特别深刻，但也许它们会以某种方式提供帮助。

我没有花很多时间思考有用的接口，优化或实施多个相互作用的子功能。如果需要，所有这些东西都可以添加。

这些基本上是在itertools.groupby上对其进行建模的，其接口可以被认为有些怪异。这确实是Python确实不是功能性编程语言的结果。 Python的生成器（以及实现迭代器协议的其他对象）是有状态的，并且没有保存和恢复生成器状态的设施。因此，功能确实返回迭代器，该迭代器连续生成迭代器，该迭代器从原始迭代器中产生值。但是，返回的迭代器共享基本的迭代，这是传递给原始调用的迭代率，这意味着，当您推进外迭代器时，当前内部迭代器中的任何无需耗时的值都会丢弃，恕不另行通知。

有（相当昂贵的）避免丢弃价值的方法（相当昂贵），但是由于最明显的一个（上升）是从一开始就排除的，所以我只是选择了groupby接口记录行为。可以用itertools.tee将内部迭代器包裹起来，以使原始迭代器独立，但价格类似于（或可能略大于）清单。在启动下一个子列表之前，它仍然要求每个子列表都必须完全生成，但是在开始使用值之前，它不需要完全生成子列表。

为了简单起见（根据我的看法:-)），我将这些功能实现为生成器，而不是对象，就像itertools和more_itertools一样。外部发电机会产生每个连续的子列表，然后在产生下一个子列表之前收集并丢弃剩余的值[Note 1]。我想在大多数情况下，在外循环尝试冲洗之前，子列表将完全耗尽，因此附加的调用会有些浪费，但是比您引用的代码itertools.groupby 。

仍然有必要从子列表中沟通原始迭代器已经用尽的事实，因为这不是您可以询问迭代器的内容。我使用非局部声明在外部发电机和内部发电机之间共享状态。在某些方面，在对象中维持状态，因为itertools.groupby确实可以更灵活，甚至可能被认为更为灵活，但是nonlocal对我有用。

我实现了more_itertools.split_at（没有maxSplits和keep_separator选项），我认为等于pipes.groups.groups.groups.groupby'<< /code>，重命名为split_betweew，以指示如果满足某些条件，则在两个连续的元素之间分配。

请注意，split_between始终在运行第一个子列表请求之前，始终迫使所提供的迭代器的第一个值。其余值是懒惰的。我尝试了几种推迟第一个对象的方法，但是最后我选择了此设计，因为它要简单得多。结果是split_at不执行初始力，即使所提供的参数为空，也始终返回一个子列表，而split_between则没有。为了确定我喜欢哪种界面，我必须尝试这两种实际问题。如果您有偏好，一定要表达它（但不能保证更改）。

from collections import deque

def split_at(iterable, pred=lambda x:x is None):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited by values for which `pred` returns
       truthiness. The default predicate returns True only for the
       value None.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.

       The delimiting values are discarded.
    '''

    done = False
    iterable = iter(iterable)

    def subiter():
        nonlocal done
        for value in iterable:
            if pred(value): return
            yield value
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

def split_between(iterable, pred=lambda before,after:before + 1 != after):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited at points where calling `pred` on two
       consecutive values produces truthiness. The default predicate
       returns True when the two values are not consecutive, making it
       possible to split a sequence of integers into contiguous ranges.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.
    '''
    iterable = iter(iterable)

    try:
        before = next(iterable)
    except StopIteration:
        return

    done = False

    def subiter():
        nonlocal done, before
        for after in iterable:
            yield before
            prev, before = before, after
            if pred(prev, before):
                return

        yield before
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

注释

collections.deque（g，maxlen = 0）是当前丢弃迭代器剩余值的最有效方法，尽管它看起来有些神秘。信用more_itertools将我指向该解决方案，以及相关的表达式计算生成器产生的对象数：
```
  cache [0] [0] if（cache：= deque（枚举（IT，1），Maxlen = 1））else 0 0
 
```
尽管我并不是要责怪more_itertools上述怪兽。（如果语句，而不是海象，他们都会使用进行。）

Here are a couple of simple iterator splitters, which I wrote in a fit of boredom. I don't think they're particularly profound, but perhaps they'll help in some way.

I didn't spend a lot of time thinking about useful interfaces, optimisations, or implementing multiple interacting sub-features. All of that stuff could be added, if desired.

These are basically modelled on itertools.groupby, whose interface could be considered a bit weird. It's the consequence of Python really not being a functional programming language. Python's generators (and other objects which implement the iterator protocol) are stateful and there is no facility for saving and restoring generator state. So the functions do return an iterator which successively generates iterators, which produce values from the original iterator. But the returned iterators share the underlying iterable, which is the iterable passed in to the original call, which means that when you advance the outer iterator, any unconsumed values in the current inner iterator are discarded without notice.

There are (fairly expensive) ways to avoid discarding the values, but since the most obvious one --listifying-- was ruled out from the start, I just went with the groupby interface despite the awkwardness of accurately documenting the behaviour. It would be possible to wrap the inner iterators with itertools.tee in order to make the original iterators independent, but that comes at a price similar to (or possibly slightly greater than) listifying. It still requires each sub-iterator to be fully generated before the next sub-iterator is started, but it doesn't require the sub-iterator to be fully generated before you start using values.

For simplicity (according to me :-) ), I implemented these functions as generators rather than objects, as with itertools and more_itertools. The outer generator yields each successive subiterator and then collects and discards any remaining values from it before yielding the next subiterator [Note 1]. I imagine that most of the time the subiterator will be fully exhausted before the outer loop tries to flush it, so the additional call will be a bit wasteful, but it's simpler than the code you cite for itertools.groupby.

It's still necessary to communicate back from the subiterator the fact that the original iterator was exhausted, since that's not something you can ask an iterator about. I use a nonlocal declaration to share state between the outer and the inner generators. In some ways, maintaining state in an object, as itertools.groupby does, might be more flexible and maybe even be considered more Pythonic, but nonlocal worked for me.

I implemented more_itertools.split_at (without maxsplits and keep_separator options) and what I think is equivalent of Pipes.Groups.groupBy', renamed as split_between to indicate that it splits between two consecutive elements if they satisfy some condition.

Note that split_between always forces the first value from the supplied iterator before it has been requested by running the first subiterator. The rest of the values are generated lazily. I tried a few ways to defer the first object, but in the end I went with this design because it's a lot simpler. The consequence is that split_at, which doesn't do the initial force, always returns at least one subiterator, even if the supplied argument is empty, whereas split_between does not. I'd have to try both of these for some real problem in order to decide which interface I prefer; if you have a preference, by all means express it (but no guarantees about changes).

from collections import deque

def split_at(iterable, pred=lambda x:x is None):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited by values for which `pred` returns
       truthiness. The default predicate returns True only for the
       value None.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.

       The delimiting values are discarded.
    '''

    done = False
    iterable = iter(iterable)

    def subiter():
        nonlocal done
        for value in iterable:
            if pred(value): return
            yield value
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

def split_between(iterable, pred=lambda before,after:before + 1 != after):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited at points where calling `pred` on two
       consecutive values produces truthiness. The default predicate
       returns True when the two values are not consecutive, making it
       possible to split a sequence of integers into contiguous ranges.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.
    '''
    iterable = iter(iterable)

    try:
        before = next(iterable)
    except StopIteration:
        return

    done = False

    def subiter():
        nonlocal done, before
        for after in iterable:
            yield before
            prev, before = before, after
            if pred(prev, before):
                return

        yield before
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

Notes

collections.deque(g, maxlen=0) is, I believe, currently the most efficient way of discarding the remaining values of an iterator, although it looks a bit mysterious. Credits to more_itertools for pointing me at that solution, and the related expression to count the number of objects produced by a generator:
```
cache[0][0] if (cache := deque(enumerate(it, 1), maxlen=1)) else 0
```
Although I don't mean to blame more_itertools for the above monstrosity. (They do it with an if statement, not a walrus.)

回复收藏 0 原文

~没有更多了~