使用 pandas 循环数据帧的最有效方法是什么?

发布于 2024-12-10 21:06:33 字数 796 浏览 0 评论 0原文

我想以顺序方式对数据帧中的财务数据执行我自己的复杂操作。

例如,我使用以下来自 Yahoo Finance 的 MSFT CSV 文件:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

操作:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

然后执行以下 那是最有效的方法?考虑到 pandas 对速度的关注,我认为必须有一些特殊的函数来以一种还检索索引的方式迭代这些值(可能通过生成器以提高内存效率)?不幸的是,df.iteritems 只能逐列迭代。

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

一紙繁鸢 2024-12-17 21:06:33

最新版本的 pandas 现在包含一个用于迭代行的内置函数。

for index, row in df.iterrows():

    # do some logic here

或者,如果您希望速度更快,请使用 itertuples()

但是,unutbu 建议使用 numpy 函数来避免迭代行,这将产生最快的代码。

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

伴梦长久 2024-12-17 21:06:33

Pandas 基于 NumPy 数组。
提高 NumPy 数组速度的关键是立即对整个数组执行操作,而不是逐行或逐项执行操作。

例如,如果 close 是一维数组,并且您想要每日百分比变化,

pct_change = close[1:]/close[:-1]

这会将百分比变化的整个数组计算为一个语句,而不是

pct_change = []
for row in close:
    pct_change.append(...)

因此尝试避免Python 循环 for i, row in enumerate(...) 完全,并且
考虑如何通过对整个数组(或数据帧)作为一个整体而不是逐行进行操作来执行计算。

Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

画中仙 2024-12-17 21:06:33

就像之前提到的,pandas 对象在一次处理整个数组时效率最高。然而,对于那些真正需要循环 Pandas DataFrame 来执行某些操作的人(比如我),我发现了至少三种方法来做到这一点。我做了一个简短的测试,看看这三个中哪一个最耗时。

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

结果:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

这可能不是衡量时间消耗的最佳方法,但对我来说很快。

以下是一些优点和缺点恕我直言:

  • .iterrows():在单独的变量中返回索引和行项,但速度明显慢
  • .itertuples():比 .iterrows() 更快,但返回索引和行项, ir[0] 是索引
  • zip:最快,但无法访问该行的索引

编辑 2020/11/10

对于它的价值,这里是一个更新的基准和一些其他替代方案(perf 与MacBookPro 2.4 GHz Intel Core i9 8 核 32 Go 2667 MHz DDR4)

import sys
import tqdm
import time
import pandas as pd

B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
    C = []
    A = time.time()
    for i,r in t.iterrows():
        C.append((r['a'], r['b']))
    B.append({"method": "iterrows", "time": time.time()-A})

    C = []
    A = time.time()
    for ir in t.itertuples():
        C.append((ir[1], ir[2]))
    B.append({"method": "itertuples", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(t['a'], t['b']):
        C.append((r[0], r[1]))
    B.append({"method": "zip", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(*t.to_dict("list").values()):
        C.append((r[0], r[1]))
    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})

    C = []
    A = time.time()
    for r in t.to_dict("records"):
        C.append((r["a"], r["b"]))
    B.append({"method": "to_dict('records')", "time": time.time()-A})

    A = time.time()
    t.agg(tuple, axis=1).tolist()
    B.append({"method": "agg", "time": time.time()-A})

    A = time.time()
    t.apply(tuple, axis=1).tolist()
    B.append({"method": "apply", "time": time.time()-A})

print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)

## Output

Python 3.7.9 (default, Oct 13 2020, 10:58:24) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
                           mean       std
method                                   
zip + to_dict('list')  0.002353  0.000168
zip                    0.003381  0.000250
itertuples             0.007659  0.000728
to_dict('records')     0.025838  0.001458
agg                    0.066391  0.007044
apply                  0.067753  0.006997
iterrows               0.647215  0.019600

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

  • .iterrows(): return index and row items in separate variables, but significantly slower
  • .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
  • zip: quickest, but no access to index of the row

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

import sys
import tqdm
import time
import pandas as pd

B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
    C = []
    A = time.time()
    for i,r in t.iterrows():
        C.append((r['a'], r['b']))
    B.append({"method": "iterrows", "time": time.time()-A})

    C = []
    A = time.time()
    for ir in t.itertuples():
        C.append((ir[1], ir[2]))
    B.append({"method": "itertuples", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(t['a'], t['b']):
        C.append((r[0], r[1]))
    B.append({"method": "zip", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(*t.to_dict("list").values()):
        C.append((r[0], r[1]))
    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})

    C = []
    A = time.time()
    for r in t.to_dict("records"):
        C.append((r["a"], r["b"]))
    B.append({"method": "to_dict('records')", "time": time.time()-A})

    A = time.time()
    t.agg(tuple, axis=1).tolist()
    B.append({"method": "agg", "time": time.time()-A})

    A = time.time()
    t.apply(tuple, axis=1).tolist()
    B.append({"method": "apply", "time": time.time()-A})

print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)

## Output

Python 3.7.9 (default, Oct 13 2020, 10:58:24) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
                           mean       std
method                                   
zip + to_dict('list')  0.002353  0.000168
zip                    0.003381  0.000250
itertuples             0.007659  0.000728
to_dict('records')     0.025838  0.001458
agg                    0.066391  0.007044
apply                  0.067753  0.006997
iterrows               0.647215  0.019600
梦境 2024-12-17 21:06:33

您可以通过转置然后调用 iteritems 来循环遍历行:

for date, row in df.T.iteritems():
   # do some logic here

我不确定这种情况下的效率。为了在迭代算法中获得最佳性能,您可能想要探索在 Cython 中编写它,因此您可以执行以下操作:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

我建议编写首先使用纯 Python 编写算法,确保它有效并查看它有多快 - 如果不够快,可以像这样以最少的工作将内容转换为 Cython,以获得与手动编码 C/C++ 一样快的内容。

You can loop through the rows by transposing and then calling iteritems:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

恍梦境° 2024-12-17 21:06:33

您有三个选择:

通过 索引 (最简单):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

使用 iterrows(最常用):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

使用 itertuples (最快):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

三个选项显示类似:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

来源:alphons.io

You have three options:

By index (simplest):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

With iterrows (most used):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

With itertuples (fastest):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

Three options display something like:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

Source: alphons.io

苏别ゝ 2024-12-17 21:06:33

在注意到Nick Crawford的答案后,我查看了iterrows,但发现它产生了(index ,系列)元组。不确定哪种方法最适合您,但我最终使用了 itertuples 方法来解决我的问题,该方法产生 (index, row_value1...) 元组。

还有 iterkv,它迭代(列、系列)元组。

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.

There's also iterkv, which iterates through (column, series) tuples.

你的呼吸 2024-12-17 21:06:33

作为一个小补充,如果您有一个应用于单个列的复杂函数,您也可以执行 apply :

http://pandas.pydata.org/pandas-docs/dev/ generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)
攀登最高峰 2024-12-17 21:06:33

正如 @joris 指出的,iterrowsitertuples 慢得多并且 itertuplesiterrows 快大约 100 倍,我在具有 500 万条记录的 DataFrame 中测试了这两种方法的速度,结果是iterrows 为 1200it/s,itertuples 为 120000it/s。

如果使用itertuples,请注意,for循环中的每个元素都是一个namedtuple,因此要获取每一列中的值,可以参考以下示例代码

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2
流绪微梦 2024-12-17 21:06:33

当然,迭代数据帧的最快方法是通过 df.values 访问底层 numpy ndarray (正如您所做的那样)或单独访问每列 df.column_name.values /代码>。由于您也想访问索引,因此可以使用 df.index.values 来实现。

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

不是蟒蛇式的吗?当然。但速度很快。

如果您想从循环中榨取更多汁液,您需要查看 cython。 Cython 将使您获得巨大的加速(想想 10 倍到 100 倍)。为了获得最佳性能,请检查 cython 的内存视图

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

Not pythonic? Sure. But fast.

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

伴随着你 2024-12-17 21:06:33

另一个建议是将 groupby 与矢量化计算结合起来,如果行的子集共享允许您这样做的特征。

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

樱花坊 2024-12-17 21:06:33

看最后一张

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in range(len(t)):
    C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B

0.46424
0.00505
0.00245
0.09879
0.00209

look at last one

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in range(len(t)):
    C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B

0.46424
0.00505
0.00245
0.09879
0.00209
埋情葬爱 2024-12-17 21:06:33

我相信循环 DataFrames 最简单、最有效的方法是使用 numpy 和 numba。在这种情况下,在许多情况下,循环的速度大约与矢量化操作一样快。如果 numba 不是一个选择,那么普通 numpy 可能是下一个最佳选择。正如多次指出的,您的默认值应该是矢量化,但这个答案仅考虑有效循环,无论出于何种原因决定循环。

对于测试用例,让我们使用 @DSM 的答案中计算百分比变化的示例。这是一个非常简单的情况,实际上,您不会编写循环来计算它,但因此它为定时矢量化方法与循环提供了合理的基线。

让我们用一个小的 DataFrame 设置 4 种方法,然后我们将在下面的更大的数据集上对它们进行计时。

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

以下是具有 100,000 行的 DataFrame 上的计时(使用 Jupyter 的 %timeit 函数执行的计时,为了便于阅读而折叠到汇总表中):

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

摘要:对于像这样的简单情况,您可以使用(矢量化)pandas 是为了简单性和可读性,(矢量化)numpy 是为了速度。如果您确实需要使用循环,请在 numpy 中执行。如果 numba 可用,请将其与 numpy 结合使用以提高速度。在这种情况下,numpy + numba 几乎与矢量化 numpy 代码一样快。

其他细节:

  • 未显示各种选项,如 iterrows、itertuples 等,它们的速度要慢几个数量级,并且永远不应该使用。
  • 这里的时间相当典型:numpy 比 pandas 快,矢量化比循环快,但将 numba 添加到 numpy 通常会显着加快 numpy 的速度。
  • 除了 pandas 选项之外的所有内容都需要将 DataFrame 列转换为 numpy 数组。该转换包含在计时中。
  • 定义/编译 numpy/numba 函数的时间不包括在计时中,但通常是任何大型数据帧的计时中可以忽略不计的组成部分。

I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.

For a test case, let's use the example from @DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.

Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.

Other details:

  • Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
  • The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
  • Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
  • The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.
゛清羽墨安 2024-12-17 21:06:33

这是最有效的方法吗?鉴于 pandas 对速度的关注,我认为必须有一些特殊的函数来迭代这些值......

绝对有一种最有效的方法:矢量化。之后是列表理解,然后是itertuples()。远离iterrows()。它非常可怕,甚至比带有常规 df["A"][i] 类型索引的原始 for 循环慢得多。

我在这里详细介绍了 13 种方法,对它们进行了快速测试,并显示了所有代码:如何迭代 Pandas DataFrames [有或没有]迭代

我花了几周时间写这个答案。以下是结果:

在此处输入图像描述

要点::

  1. 如果您的目标是易于阅读、编写和维护的代码,同时仍然非常快,请使用 列表理解。它仅比纯矢量化慢10倍
  2. 如果您的目标是尽可能快编写代码,请使用纯矢量化。然而,当您在计算每一行的公式中包含复杂的方程(例如 if 语句)时,读写会变得更加困难。

iterrows() 这样的函数非常慢,比纯矢量化慢约 600 倍

为了证明我速度测试的所有 13 种技术即使在复杂的公式中也是可能的,我选择了这个重要的公式来通过所有技术进行计算,其中 ABCD 是列,i 下标是行(例如:i-2 是 2 行向上,i-1 是上一个行,i 是当前行,i+1 是下一行,依此类推):

在此处输入图像描述

有关更多详细信息,以及所有 13 种技术的代码,请参阅我的主要答案:如何在不迭代的情况下迭代 Pandas DataFrame

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values...

There absolutely is a most-efficient way: vectorization. After that comes list comprehension, followed by itertuples(). Stay away from iterrows(). It's pretty horrible, coming in much slower than a raw for loop with regular df["A"][i]-type indexing, even.

I present 13 ways in great detail here, speed testing them all, and showing all of the code: How to iterate over Pandas DataFrames [with and without] iterating.

I spent several weeks writing that answer. Here are the results:

enter image description here

Key takeaways::

  1. If your goal is code that is easy to read, write, and maintain, while still being really fast, use list comprehension. It is only ~10x slower than pure vectorization.
  2. If your goal is code that is as fast as possible, use pure vectorization. It is harder to read and write when you have complicated equations such as if statements in your formula you are calculating for each row, however.

Functions like iterrows() are horribly slow, at ~600x slower than pure vectorization.

To prove that all 13 techniques I speed tested are possible even in complicated formulas, I chose this non-trivial formula to calculate via all of the techniques, where A, B, C, and D are columns, and the i subscripts are rows (ex: i-2 is 2 rows up, i-1 is the previous row, i is the current row, i+1 is the next row, etc.):

enter image description here

For a ton more detail, and the code for all 13 techniques, refer to my main answer: How to iterate over Pandas DataFrames without iterating.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文