解析一个觉得而不列出每个块
假设我想实现python的分裂,而不清单每个块,类似于itertools.groupbys.groupby
,其块很懒惰。但是我想在更复杂的状态下做到这一点,而不是钥匙的平等。更像是解析器。
例如,假设我想将奇数数字用作整数的划界器。像more_itertools.split_at(lambda x:x%2 == 1,xs)
。 (但是more_itertools.split_at
列表每个块。)
在解析器组合语言中,这可以称为sepby1(奇数,许多(偶数))
。在Haskell中,有parsec
,pipes-parse
和pipes-group
库,可以解决这种问题。例如,编写itertools.groupby
类似于groupsby'
的版本是足够且有趣hackage.haskell.org/package/pipes-group-1.0.12/docs/pipes-group.html#:%7e:text = 333%7c4%7c5%22-22-,groupsby,“ rel =” nofollow noreferrer'>在这里)。
可能会有一些带有itertools.groupby
的巧妙的jiu jitsu,也许可以应用itertools.pairwise
,然后itertools.groupbys.groupby
,然后返回到返回单元素。
我想我可以自己写作作为生成器,但是在python(下)中编写itertools.groupby
已经涉及。也不容易推广。
似乎应该更普遍地有一些东西,例如一种相对毫无痛苦的方式编写解析器和合并器,以供任何类型的流。
# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
self.id = object()
while self.currkey == self.tgtkey:
self.currvalue = next(self.it) # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey, self.id))
def _grouper(self, tgtkey, id):
while self.id is id and self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)
Suppose I want to achieve a splitting of a Python iterable, without listifying each chunk, similar to itertools.groupby
, whose chunks are lazy. But I want to do it on a more sophisticated condition than equality of a key. So more like a parser.
For example, suppose I want to use odd numbers as delimiters in an iterable of integers. Like more_itertools.split_at(lambda x: x % 2 == 1, xs)
. (But more_itertools.split_at
listifies each chunk.)
In parser combinator language this might be called sepBy1(odd, many(even))
. In Haskell there are the Parsec
, pipes-parse
and pipes-group
libraries which address this kind of problem. For instance, it would be sufficient and interesting to write an itertools.groupby
-like version of groupsBy'
from Pipes.Group (see here).
There could probably be some clever jiu jitsu with itertools.groupby
, perhaps applying itertools.pairwise
, then itertools.groupby
, and then going back to single elements.
I could write it myself as a generator, I suppose, but writing itertools.groupby
in Python (below) is already pretty involved. Also not readily generalizable.
Seems like there should be something for this more generally, like a relatively painless way of writing parsers and combinators for streams of whatever type.
# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
self.id = object()
while self.currkey == self.tgtkey:
self.currvalue = next(self.it) # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey, self.id))
def _grouper(self, tgtkey, id):
while self.id is id and self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是几个简单的迭代器拆分器,我很无聊。我认为它们不是特别深刻,但也许它们会以某种方式提供帮助。
我没有花很多时间思考有用的接口,优化或实施多个相互作用的子功能。如果需要,所有这些东西都可以添加。
这些基本上是在
itertools.groupby
上对其进行建模的,其接口可以被认为有些怪异。这确实是Python确实不是功能性编程语言的结果。 Python的生成器(以及实现迭代器协议的其他对象)是有状态的,并且没有保存和恢复生成器状态的设施。因此,功能确实返回迭代器,该迭代器连续生成迭代器,该迭代器从原始迭代器中产生值。但是,返回的迭代器共享基本的迭代,这是传递给原始调用的迭代率,这意味着,当您推进外迭代器时,当前内部迭代器中的任何无需耗时的值都会丢弃,恕不另行通知。有(相当昂贵的)避免丢弃价值的方法(相当昂贵),但是由于最明显的一个(上升)是从一开始就排除的,所以我只是选择了
groupby
接口记录行为。可以用itertools.tee
将内部迭代器包裹起来,以使原始迭代器独立,但价格类似于(或可能略大于)清单。在启动下一个子列表之前,它仍然要求每个子列表都必须完全生成,但是在开始使用值之前,它不需要完全生成子列表。为了简单起见(根据我的看法:-)),我将这些功能实现为生成器,而不是对象,就像
itertools
和more_itertools
一样。外部发电机会产生每个连续的子列表,然后在产生下一个子列表之前收集并丢弃剩余的值[Note 1]。我想在大多数情况下,在外循环尝试冲洗之前,子列表将完全耗尽,因此附加的调用会有些浪费,但是比您引用的代码itertools.groupby 。
仍然有必要从子列表中沟通原始迭代器已经用尽的事实,因为这不是您可以询问迭代器的内容。我使用
非局部
声明在外部发电机和内部发电机之间共享状态。在某些方面,在对象中维持状态,因为itertools.groupby
确实可以更灵活,甚至可能被认为更为灵活,但是nonlocal
对我有用。我实现了
more_itertools.split_at
(没有maxSplits
和keep_separator
选项),我认为等于pipes.groups.groups.groups.groupby'<< /code>,重命名为
split_betweew
,以指示如果满足某些条件,则在两个连续的元素之间分配。请注意,
split_between
始终在运行第一个子列表请求之前,始终迫使所提供的迭代器的第一个值。其余值是懒惰的。我尝试了几种推迟第一个对象的方法,但是最后我选择了此设计,因为它要简单得多。结果是split_at
不执行初始力,即使所提供的参数为空,也始终返回一个子列表,而split_between
则没有。为了确定我喜欢哪种界面,我必须尝试这两种实际问题。如果您有偏好,一定要表达它(但不能保证更改)。注释
collections.deque(g,maxlen = 0)
是当前丢弃迭代器剩余值的最有效方法,尽管它看起来有些神秘。信用more_itertools
将我指向该解决方案,以及相关的表达式计算生成器产生的对象数:尽管我并不是要责怪
more_itertools
上述怪兽。 (如果语句,而不是海象,他们都会使用进行。)
Here are a couple of simple iterator splitters, which I wrote in a fit of boredom. I don't think they're particularly profound, but perhaps they'll help in some way.
I didn't spend a lot of time thinking about useful interfaces, optimisations, or implementing multiple interacting sub-features. All of that stuff could be added, if desired.
These are basically modelled on
itertools.groupby
, whose interface could be considered a bit weird. It's the consequence of Python really not being a functional programming language. Python's generators (and other objects which implement the iterator protocol) are stateful and there is no facility for saving and restoring generator state. So the functions do return an iterator which successively generates iterators, which produce values from the original iterator. But the returned iterators share the underlying iterable, which is the iterable passed in to the original call, which means that when you advance the outer iterator, any unconsumed values in the current inner iterator are discarded without notice.There are (fairly expensive) ways to avoid discarding the values, but since the most obvious one --listifying-- was ruled out from the start, I just went with the
groupby
interface despite the awkwardness of accurately documenting the behaviour. It would be possible to wrap the inner iterators withitertools.tee
in order to make the original iterators independent, but that comes at a price similar to (or possibly slightly greater than) listifying. It still requires each sub-iterator to be fully generated before the next sub-iterator is started, but it doesn't require the sub-iterator to be fully generated before you start using values.For simplicity (according to me :-) ), I implemented these functions as generators rather than objects, as with
itertools
andmore_itertools
. The outer generator yields each successive subiterator and then collects and discards any remaining values from it before yielding the next subiterator [Note 1]. I imagine that most of the time the subiterator will be fully exhausted before the outer loop tries to flush it, so the additional call will be a bit wasteful, but it's simpler than the code you cite foritertools.groupby
.It's still necessary to communicate back from the subiterator the fact that the original iterator was exhausted, since that's not something you can ask an iterator about. I use a
nonlocal
declaration to share state between the outer and the inner generators. In some ways, maintaining state in an object, asitertools.groupby
does, might be more flexible and maybe even be considered more Pythonic, butnonlocal
worked for me.I implemented
more_itertools.split_at
(withoutmaxsplits
andkeep_separator
options) and what I think is equivalent ofPipes.Groups.groupBy'
, renamed assplit_between
to indicate that it splits between two consecutive elements if they satisfy some condition.Note that
split_between
always forces the first value from the supplied iterator before it has been requested by running the first subiterator. The rest of the values are generated lazily. I tried a few ways to defer the first object, but in the end I went with this design because it's a lot simpler. The consequence is thatsplit_at
, which doesn't do the initial force, always returns at least one subiterator, even if the supplied argument is empty, whereassplit_between
does not. I'd have to try both of these for some real problem in order to decide which interface I prefer; if you have a preference, by all means express it (but no guarantees about changes).Notes
collections.deque(g, maxlen=0)
is, I believe, currently the most efficient way of discarding the remaining values of an iterator, although it looks a bit mysterious. Credits tomore_itertools
for pointing me at that solution, and the related expression to count the number of objects produced by a generator:Although I don't mean to blame
more_itertools
for the above monstrosity. (They do it with anif
statement, not a walrus.)