使用 pandas 循环数据帧的最有效方法是什么?
我想以顺序方式对数据帧中的财务数据执行我自己的复杂操作。
例如,我使用以下来自 Yahoo Finance 的 MSFT CSV 文件:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
操作:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
然后执行以下 那是最有效的方法?考虑到 pandas 对速度的关注,我认为必须有一些特殊的函数来以一种还检索索引的方式迭代这些值(可能通过生成器以提高内存效率)?不幸的是,df.iteritems 只能逐列迭代。
I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems
unfortunately only iterates column by column.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
最新版本的 pandas 现在包含一个用于迭代行的内置函数。
或者,如果您希望速度更快,请使用 itertuples()
但是,unutbu 建议使用 numpy 函数来避免迭代行,这将产生最快的代码。
The newest versions of pandas now include a built-in function for iterating over rows.
Or, if you want it faster use
itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.
Pandas 基于 NumPy 数组。
提高 NumPy 数组速度的关键是立即对整个数组执行操作,而不是逐行或逐项执行操作。
例如,如果
close
是一维数组,并且您想要每日百分比变化,这会将百分比变化的整个数组计算为一个语句,而不是
因此尝试避免Python 循环
for i, row in enumerate(...)
完全,并且考虑如何通过对整个数组(或数据帧)作为一个整体而不是逐行进行操作来执行计算。
Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if
close
is a 1-d array, and you want the day-over-day percent change,This computes the entire array of percent changes as one statement, instead of
So try to avoid the Python loop
for i, row in enumerate(...)
entirely, andthink about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.
就像之前提到的,pandas 对象在一次处理整个数组时效率最高。然而,对于那些真正需要循环 Pandas DataFrame 来执行某些操作的人(比如我),我发现了至少三种方法来做到这一点。我做了一个简短的测试,看看这三个中哪一个最耗时。
结果:
这可能不是衡量时间消耗的最佳方法,但对我来说很快。
以下是一些优点和缺点恕我直言:
编辑 2020/11/10
对于它的价值,这里是一个更新的基准和一些其他替代方案(perf 与MacBookPro 2.4 GHz Intel Core i9 8 核 32 Go 2667 MHz DDR4)
Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
Result:
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
EDIT 2020/11/10
For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)
您可以通过转置然后调用 iteritems 来循环遍历行:
我不确定这种情况下的效率。为了在迭代算法中获得最佳性能,您可能想要探索在 Cython 中编写它,因此您可以执行以下操作:
我建议编写首先使用纯 Python 编写算法,确保它有效并查看它有多快 - 如果不够快,可以像这样以最少的工作将内容转换为 Cython,以获得与手动编码 C/C++ 一样快的内容。
You can loop through the rows by transposing and then calling iteritems:
I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:
I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.
您有三个选择:
通过 索引 (最简单):
使用 iterrows(最常用):
使用 itertuples (最快):
三个选项显示类似:
来源:alphons.io
You have three options:
By index (simplest):
With iterrows (most used):
With itertuples (fastest):
Three options display something like:
Source: alphons.io
在注意到Nick Crawford的答案后,我查看了
iterrows
,但发现它产生了(index ,系列)元组。不确定哪种方法最适合您,但我最终使用了 itertuples 方法来解决我的问题,该方法产生 (index, row_value1...) 元组。还有 iterkv,它迭代(列、系列)元组。
I checked out
iterrows
after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using theitertuples
method for my problem, which yields (index, row_value1...) tuples.There's also
iterkv
, which iterates through (column, series) tuples.作为一个小补充,如果您有一个应用于单个列的复杂函数,您也可以执行 apply :
http://pandas.pydata.org/pandas-docs/dev/ generated/pandas.DataFrame.apply.html
Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
正如 @joris 指出的,
iterrows
比itertuples
慢得多并且itertuples
比iterrows
快大约 100 倍,我在具有 500 万条记录的 DataFrame 中测试了这两种方法的速度,结果是iterrows
为 1200it/s,itertuples
为 120000it/s。如果使用
itertuples
,请注意,for循环中的每个元素都是一个namedtuple,因此要获取每一列中的值,可以参考以下示例代码As @joris pointed out,
iterrows
is much slower thanitertuples
anditertuples
is approximately 100 times faster thaniterrows
, and I tested the speed of both methods in a DataFrame with 5 million records the result is foriterrows
, it is 1200it/s, anditertuples
is 120000it/s.If you use
itertuples
, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code当然,迭代数据帧的最快方法是通过 df.values 访问底层 numpy ndarray (正如您所做的那样)或单独访问每列 df.column_name.values /代码>。由于您也想访问索引,因此可以使用 df.index.values 来实现。
不是蟒蛇式的吗?当然。但速度很快。
如果您想从循环中榨取更多汁液,您需要查看 cython。 Cython 将使您获得巨大的加速(想想 10 倍到 100 倍)。为了获得最佳性能,请检查 cython 的内存视图。
For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via
df.values
(as you do) or by accessing each column separatelydf.column_name.values
. Since you want to have access to the index too, you can usedf.index.values
for that.Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.
另一个建议是将 groupby 与矢量化计算结合起来,如果行的子集共享允许您这样做的特征。
Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.
看最后一张
look at last one
我相信循环 DataFrames 最简单、最有效的方法是使用 numpy 和 numba。在这种情况下,在许多情况下,循环的速度大约与矢量化操作一样快。如果 numba 不是一个选择,那么普通 numpy 可能是下一个最佳选择。正如多次指出的,您的默认值应该是矢量化,但这个答案仅考虑有效循环,无论出于何种原因决定循环。
对于测试用例,让我们使用 @DSM 的答案中计算百分比变化的示例。这是一个非常简单的情况,实际上,您不会编写循环来计算它,但因此它为定时矢量化方法与循环提供了合理的基线。
让我们用一个小的 DataFrame 设置 4 种方法,然后我们将在下面的更大的数据集上对它们进行计时。
以下是具有 100,000 行的 DataFrame 上的计时(使用 Jupyter 的
%timeit
函数执行的计时,为了便于阅读而折叠到汇总表中):摘要:对于像这样的简单情况,您可以使用(矢量化)pandas 是为了简单性和可读性,(矢量化)numpy 是为了速度。如果您确实需要使用循环,请在 numpy 中执行。如果 numba 可用,请将其与 numpy 结合使用以提高速度。在这种情况下,numpy + numba 几乎与矢量化 numpy 代码一样快。
其他细节:
I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.
For a test case, let's use the example from @DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.
Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.
And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's
%timeit
function, collapsed to a summary table for readability):Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.
Other details:
绝对有一种最有效的方法:矢量化。之后是列表理解,然后是
itertuples()
。远离iterrows()
。它非常可怕,甚至比带有常规 df["A"][i] 类型索引的原始 for 循环慢得多。我在这里详细介绍了 13 种方法,对它们进行了快速测试,并显示了所有代码:如何迭代 Pandas
DataFrame
s [有或没有]迭代。我花了几周时间写这个答案。以下是结果:
要点::
if
语句)时,读写会变得更加困难。像
iterrows()
这样的函数非常慢,比纯矢量化慢约 600 倍。为了证明我速度测试的所有 13 种技术即使在复杂的公式中也是可能的,我选择了这个重要的公式来通过所有技术进行计算,其中
A
、B
、C
和D
是列,i
下标是行(例如:i-2
是 2 行向上,i-1
是上一个行,i
是当前行,i+1
是下一行,依此类推):有关更多详细信息,以及所有 13 种技术的代码,请参阅我的主要答案:如何在不迭代的情况下迭代 Pandas
DataFrame
。There absolutely is a most-efficient way: vectorization. After that comes list comprehension, followed by
itertuples()
. Stay away fromiterrows()
. It's pretty horrible, coming in much slower than a raw for loop with regulardf["A"][i]
-type indexing, even.I present 13 ways in great detail here, speed testing them all, and showing all of the code: How to iterate over Pandas
DataFrame
s [with and without] iterating.I spent several weeks writing that answer. Here are the results:
Key takeaways::
if
statements in your formula you are calculating for each row, however.Functions like
iterrows()
are horribly slow, at ~600x slower than pure vectorization.To prove that all 13 techniques I speed tested are possible even in complicated formulas, I chose this non-trivial formula to calculate via all of the techniques, where
A
,B
,C
, andD
are columns, and thei
subscripts are rows (ex:i-2
is 2 rows up,i-1
is the previous row,i
is the current row,i+1
is the next row, etc.):For a ton more detail, and the code for all 13 techniques, refer to my main answer: How to iterate over Pandas
DataFrame
s without iterating.