如何分块迭代列表
我有一个 Python 脚本,它接受一个整数列表作为输入,我需要一次处理四个整数。 不幸的是,我无法控制输入,或者我会将其作为四元素元组列表传递。 目前,我正在以这种方式迭代它:
for i in range(0, len(ints), 4):
# dummy op for example code
foo += ints[i] * ints[i + 1] + ints[i + 2] * ints[i + 3]
不过,它看起来很像“C-think”,这让我怀疑有一种更Pythonic的方式来处理这种情况。 该列表在迭代后将被丢弃,因此不需要保留。 也许这样的事情会更好?
while ints:
foo += ints[0] * ints[1] + ints[2] * ints[3]
ints[0:4] = []
但仍然不太“感觉”正确。 :-/
更新: 随着 Python 3.12 的发布,我更改了接受的答案。 对于尚未(或无法)跳转到 Python 3.12 的任何人,我鼓励您查看之前接受的答案< /a> 或下面任何其他优秀的、向后兼容的答案。
相关问题: 如何拆分在Python中将列表分成均匀大小的块?
I have a Python script which takes as input a list of integers, which I need to work with four integers at a time. Unfortunately, I don't have control of the input, or I'd have it passed in as a list of four-element tuples. Currently, I'm iterating over it this way:
for i in range(0, len(ints), 4):
# dummy op for example code
foo += ints[i] * ints[i + 1] + ints[i + 2] * ints[i + 3]
It looks a lot like "C-think", though, which makes me suspect there's a more pythonic way of dealing with this situation. The list is discarded after iterating, so it needn't be preserved. Perhaps something like this would be better?
while ints:
foo += ints[0] * ints[1] + ints[2] * ints[3]
ints[0:4] = []
Still doesn't quite "feel" right, though. :-/
Update: With the release of Python 3.12, I've changed the accepted answer. For anyone who has not (or cannot) make the jump to Python 3.12 yet, I encourage you to check out the previous accepted answer or any of the other excellent, backwards-compatible answers below.
Related question: How do you split a list into evenly sized chunks in Python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
从 Python 3.8 开始,您可以使用 walrus
:=
运算符和itertools.islice
。Since Python 3.8 you can use the walrus
:=
operator anditertools.islice
.其他方式:
Another way:
如果您不介意使用外部包,您可以使用
iteration_utilities .grouper
来自iteration_utilties
1。 它支持所有可迭代(不仅仅是序列):打印:
如果长度不是组大小的倍数,它还支持填充(不完整的最后一组)或截断(丢弃不完整的最后一组)最后一个:
Benchmarks
I还决定比较上述几种方法的运行时间。 这是一个双对数图,根据不同大小的列表分为“10”个元素组。 对于定性结果:越低意味着越快:
至少在此基准测试中,
iteration_utilities.grouper
表现最好。 接下来是Craz的方法。该基准测试是使用
simple_benchmark
1 创建的。 用于运行此基准测试的代码是:1 免责声明:我是库
iteration_utilities
和simple_benchmark
的作者。If you don't mind using an external package you could use
iteration_utilities.grouper
fromiteration_utilties
1. It supports all iterables (not just sequences):which prints:
In case the length isn't a multiple of the groupsize it also supports filling (the incomplete last group) or truncating (discarding the incomplete last group) the last one:
Benchmarks
I also decided to compare the run-time of a few of the mentioned approaches. It's a log-log plot grouping into groups of "10" elements based on a list of varying size. For qualitative results: Lower means faster:
At least in this benchmark the
iteration_utilities.grouper
performs best. Followed by the approach of Craz.The benchmark was created with
simple_benchmark
1. The code used to run this benchmark was:1 Disclaimer: I'm the author of the libraries
iteration_utilities
andsimple_benchmark
.从 Python 3.12 开始,itertools 模块获得
batched
函数,专门涵盖对输入可迭代的批次进行迭代,其中最终批次可能不完整(每个批次是一个元组
)。 根据文档中给出的示例代码:性能说明:
batched
的实现,与迄今为止所有itertools
函数一样,位于 C 层,因此它具有 Python 级别代码无法比拟的优化能力,例如,islice
上调用tuple
的解决方案所做的那样).__next__
的方法的方式)n
次(基于zip_longest((iter(iterable),) * n)
检查结束情况是一个简单的 C 级NULL
检查(很简单,并且无论如何都需要处理可能的异常)goto
后跟一个直接 < code>realloc(无需将副本复制到较小的元组
)到已知的最终大小,因为它正在跟踪已成功拉取的元素数量(无需复杂的“创建哨兵以供使用”) asfillvalue
并执行 Python 级别if
/else
检查每个批次以查看其是否为空,最后一个批次需要搜索fillvalue
最后出现,用于创建基于zip_longest
的解决方案所需的缩减元组
”。在所有这些优点之间,它应该大大优于任何Python级别的解决方案(甚至高度优化的解决方案,推动大多数或所有每个项目的工作都转移到 C 层),无论输入迭代是长还是短,也无论批大小和最终(可能不完整)批的大小(基于
zip_longest
的解决方案使用有保证的唯一fillvalue
来保证安全是最好的解决方案当itertools.batched
不可用时,几乎所有情况下的解决方案,但它们可能会遭受“很少有大批次,最终批次大部分未完全填充”的病态情况,特别是在 3.10 之前 < code>bisect 无法用于优化将fillvalue
切片从O(n)
线性搜索降至O(log n)
二分搜索,但是batched
完全避免了这种搜索,因此它根本不会遇到病态的情况)。As of Python 3.12, the
itertools
module gains abatched
function that specifically covers iterating over batches of an input iterable, where the final batch may be incomplete (each batch is atuple
). Per the example code given in the docs:Performance notes:
The implementation of
batched
, like allitertools
functions to date, is at the C layer, so it's capable of optimizations Python level code cannot match, e.g.tuple
of precisely the correct size (for all but the last batch), instead of building thetuple
up element by element with amortized growth causing multiple reallocations (the way a solution callingtuple
on anislice
does).__next__
function of the underlying iterator once per batch, notn
times per batch (the way azip_longest((iter(iterable),) * n)
-based approach does)NULL
check (trivial, and required to handle possible exceptions anyway)goto
followed by a directrealloc
(no making a copy into a smallertuple
) down to the already known final size, since it's tracking how many elements it has successfully pulled (no complex "create sentinel for use asfillvalue
and do Python levelif
/else
checks for each batch to see if it's empty, with the final batch requiring a search for where thefillvalue
appeared last, to create the cut-downtuple
" required byzip_longest
-based solutions).Between all these advantages, it should massively outperform any Python-level solution (even highly optimized ones that push most or all of the per-item work to the C layer), regardless of whether the input iterable is long or short, and regardless of whether the batch size and the size of the final (possibly incomplete) batch (
zip_longest
-based solutions using guaranteed uniquefillvalue
s for safety are the best possible solution for almost all cases whenitertools.batched
is not available, but they can suffer in pathological cases of "few large batches, with final batch mostly, not completely, filled", especially pre-3.10 whenbisect
can't be used to optimize slicing off thefillvalue
s fromO(n)
linear search down toO(log n)
binary search, butbatched
avoids that search entirely, so it won't experience pathological cases at all).我需要一个也适用于集合和生成器的解决方案。 我想不出任何非常简短和漂亮的东西,但至少它具有可读性。
列表:
设置:
生成器:
I needed a solution that would also work with sets and generators. I couldn't come up with anything very short and pretty, but it's quite readable at least.
List:
Set:
Generator:
more-itertools 包有 chunked 方法正是这样做的:
打印
chunked
返回中的项目一个列表。 如果您更喜欢可迭代,请使用 ichunked。The more-itertools package has chunked method which does exactly that:
Prints
chunked
returns the items in a list. If you'd prefer iterables, use ichunked.这个问题的理想解决方案是使用迭代器(而不仅仅是序列)。 它也应该很快。
这是 itertools 文档提供的解决方案:
Using ipython's
%timeit
on my mac book Air, I get 47.5 us per loop.然而,这对我来说确实不起作用,因为结果被填充为均匀大小的组。 没有填充的解决方案稍微复杂一些。 最幼稚的解决方案可能是:
简单,但相当慢:每个循环 693 us
我能想到的最佳解决方案使用
islice
进行内部循环:使用相同的数据集,每个循环我得到 305 us 。
由于无法比这更快地获得纯粹的解决方案,我提供了以下解决方案,但有一个重要的警告:如果您的输入数据中包含
filldata
实例,您可能会得到错误的答案。我真的不喜欢这个答案,但它的速度要快得多。 每个循环 124 us
The ideal solution for this problem works with iterators (not just sequences). It should also be fast.
This is the solution provided by the documentation for itertools:
Using ipython's
%timeit
on my mac book air, I get 47.5 us per loop.However, this really doesn't work for me since the results are padded to be even sized groups. A solution without the padding is slightly more complicated. The most naive solution might be:
Simple, but pretty slow: 693 us per loop
The best solution I could come up with uses
islice
for the inner loop:With the same dataset, I get 305 us per loop.
Unable to get a pure solution any faster than that, I provide the following solution with an important caveat: If your input data has instances of
filldata
in it, you could get wrong answer.I really don't like this answer, but it is significantly faster. 124 us per loop
由于还没有人提到,这里有一个 zip() 解决方案:
仅当序列的长度始终能被块大小整除时,它才有效,或者如果不能整除则不关心尾随块。
示例:
或者使用 itertools.izip 返回迭代器而不是列表:
填充可以是使用@ΤΖΩΤΖIΟΥ的答案修复< /a>:
Since nobody's mentioned it yet here's a
zip()
solution:It works only if your sequence's length is always divisible by the chunk size or you don't care about a trailing chunk if it isn't.
Example:
Or using itertools.izip to return an iterator instead of a list:
Padding can be fixed using @ΤΖΩΤΖΙΟΥ's answer:
与其他提案类似,但不完全相同,我喜欢这样做,因为它简单且易于阅读:
这样您就不会得到最后的部分块。 如果您想将
(9, None, None, None)
作为最后一个块,只需使用itertools
中的izip_longest
即可。Similar to other proposals, but not exactly identical, I like doing it this way, because it's simple and easy to read:
This way you won't get the last partial chunk. If you want to get
(9, None, None, None)
as last chunk, just useizip_longest
fromitertools
.另一种方法是使用
iter
的两个参数形式:这可以轻松地适应使用填充(这类似于 Markus Jarderot 的回答):
这些甚至可以组合起来用于可选填充:
Another approach would be to use the two-argument form of
iter
:This can be adapted easily to use padding (this is similar to Markus Jarderot’s answer):
These can even be combined for optional padding:
如果列表很大,执行此操作的最佳方法是使用生成器:
If the list is large, the highest-performing way to do this will be to use a generator:
使用小功能和小东西确实对我没有吸引力; 我更喜欢只使用切片:
Using little functions and things really doesn't appeal to me; I prefer to just use slices:
使用 map() 而不是 zip() 修复了 JF Sebastian 答案中的填充问题:
示例:
Using map() instead of zip() fixes the padding issue in J.F. Sebastian's answer:
Example:
用于以
4
大小的块迭代列表x
的单行临时解决方案 -One-liner, adhoc solution to iterate over a list
x
in chunks of size4
-为了避免所有转换为列表
import itertools
并且:产生:
我检查了
groupby
并且它不会转换为列表或使用len
所以我(认为)这将延迟每个值的解析,直到实际使用它为止。 遗憾的是,(目前)所有可用的答案似乎都没有提供这种变化。显然,如果您需要依次处理每个项目,请在 g 上嵌套一个 for 循环:
我对此特别感兴趣的是需要使用生成器来批量向 gmail API 提交最多 1000 个更改:
To avoid all conversions to a list
import itertools
and:Produces:
I checked
groupby
and it doesn't convert to list or uselen
so I (think) this will delay resolution of each value until it is actually used. Sadly none of the available answers (at this time) seemed to offer this variation.Obviously if you need to handle each item in turn nest a for loop over g:
My specific interest in this was the need to consume a generator to submit changes in batches of up to 1000 to the gmail API:
除非我错过了什么,否则没有提到以下带有生成器表达式的简单解决方案。 它假设块的大小和数量都是已知的(通常是这种情况),并且不需要填充:
Unless I misses something, the following simple solution with generator expressions has not been mentioned. It assumes that both the size and the number of chunks are known (which is often the case), and that no padding is required:
在你的第二种方法中,我将通过这样做前进到下一组 4 组:
但是,我没有进行任何性能测量,所以我不知道哪一个可能更有效。
话虽如此,我通常会选择第一种方法。 这并不漂亮,但这通常是与外部世界交互的结果。
In your second method, I would advance to the next group of 4 by doing this:
However, I haven't done any performance measurement so I don't know which one might be more efficient.
Having said that, I would usually choose the first method. It's not pretty, but that's often a consequence of interfacing with the outside world.
使用 NumPy 很简单:
输出:
With NumPy it's simple:
output:
我从不希望我的块被填充,所以这个要求是必不可少的。 我发现处理任何可迭代的能力也是必要的。 鉴于此,我决定扩展已接受的答案,https://stackoverflow.com/a/434411/1074659。
如果由于需要比较和过滤填充值而不需要填充,则此方法的性能会受到轻微影响。 然而,对于大块大小,该实用程序的性能非常好。
I never want my chunks padded, so that requirement is essential. I find that the ability to work on any iterable is also requirement. Given that, I decided to extend on the accepted answer, https://stackoverflow.com/a/434411/1074659.
Performance takes a slight hit in this approach if padding is not wanted due to the need to compare and filter the padded values. However, for large chunk sizes, this utility is very performant.
还有一个答案,其优点是:
1)易于理解
2)适用于任何可迭代,而不仅仅是序列(上面的一些答案会因文件句柄而窒息)
3) 不会一次性将块加载到内存中
4) 不会在内存中创建对同一迭代器的引用的长块列表
5) 列表末尾没有填充值的填充话
虽如此,我还没有对其进行计时,因此它可能比一些更聪明的方法慢,并且考虑到用例,某些优点可能是不相关的。
更新:
由于内部和外部循环从同一迭代器中提取值,因此存在一些缺点:
1) continue 在外循环中没有按预期工作 - 它只是继续到下一个项目而不是跳过一个块。 但是,这似乎不是问题,因为外循环中没有任何可测试的内容。
2) 在内循环中,break 无法按预期工作 - 控制将再次进入内循环,并使用迭代器中的下一个项目。 要跳过整个块,请将内部迭代器(上面的 ii)包装在元组中,例如
for c in tuple(ii)
,或者设置一个标志并耗尽迭代器。Yet another answer, the advantages of which are:
1) Easily understandable
2) Works on any iterable, not just sequences (some of the above answers will choke on filehandles)
3) Does not load the chunk into memory all at once
4) Does not make a chunk-long list of references to the same iterator in memory
5) No padding of fill values at the end of the list
That being said, I haven't timed it so it might be slower than some of the more clever methods, and some of the advantages may be irrelevant given the use case.
Update:
A couple of drawbacks due to the fact the inner and outer loops are pulling values from the same iterator:
1) continue doesn't work as expected in the outer loop - it just continues on to the next item rather than skipping a chunk. However, this doesn't seem like a problem as there's nothing to test in the outer loop.
2) break doesn't work as expected in the inner loop - control will wind up in the inner loop again with the next item in the iterator. To skip whole chunks, either wrap the inner iterator (ii above) in a tuple, e.g.
for c in tuple(ii)
, or set a flag and exhaust the iterator.您可以使用 分区 或 chunks 函数来自 funcy 库:
这些函数还有迭代器版本
ipartition
和ichunks
,在这种情况下会更高效。您还可以查看他们的实现。
You can use partition or chunks function from funcy library:
These functions also has iterator versions
ipartition
andichunks
, which will be more efficient in this case.You can also peek at their implementation.
关于
JF Sebastian
这里给出的解决方案:它很聪明,但有一个缺点 - 总是返回元组。 如何获取字符串?
当然你可以写
''.join(chunker(...))
,但临时元组无论如何都会被构造。您可以通过编写自己的
zip
来摆脱临时元组,如下所示:然后
示例用法:
About solution gave by
J.F. Sebastian
here:It's clever, but has one disadvantage - always return tuple. How to get string instead?
Of course you can write
''.join(chunker(...))
, but the temporary tuple is constructed anyway.You can get rid of the temporary tuple by writing own
zip
, like this:Then
Example usage:
我喜欢这种方法。 它感觉简单且并不神奇,并且支持所有可迭代类型并且不需要导入。
I like this approach. It feels simple and not magical and supports all iterable types and doesn't require imports.
这里非常Pythonic(您也可以内联 split_groups 函数的主体)
Quite pythonic here (you may also inline the body of the
split_groups
function)适用于任何序列:
Works with any sequence:
修改自 Python 的 Recipes 部分/docs.python.org/library/itertools.html" rel="noreferrer">
itertools
文档:示例
注意:在 Python 2 上,使用
izip_longest
而不是zip_longest
。Modified from the Recipes section of Python's
itertools
docs:Example
Note: on Python 2 use
izip_longest
instead ofzip_longest
.