使用 pandas 循环数据帧的最有效方法是什么？

发布于 2024-12-10 21:06:33 字数 796 浏览 0 评论 0原文

我想以顺序方式对数据帧中的财务数据执行我自己的复杂操作。

例如，我使用以下来自 Yahoo Finance 的 MSFT CSV 文件：

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

操作：

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

然后执行以下那是最有效的方法？考虑到 pandas 对速度的关注，我认为必须有一些特殊的函数来以一种还检索索引的方式迭代这些值（可能通过生成器以提高内存效率）？不幸的是，df.iteritems 只能逐列迭代。

原文

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一紙繁鸢 2024-12-17 21:06:33

最新版本的 pandas 现在包含一个用于迭代行的内置函数。

for index, row in df.iterrows():

    # do some logic here

或者，如果您希望速度更快，请使用 itertuples()

但是，unutbu 建议使用 numpy 函数来避免迭代行，这将产生最快的代码。

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

回复收藏 0 原文

伴梦长久 2024-12-17 21:06:33

Pandas 基于 NumPy 数组。
提高 NumPy 数组速度的关键是立即对整个数组执行操作，而不是逐行或逐项执行操作。

例如，如果 close 是一维数组，并且您想要每日百分比变化，

pct_change = close[1:]/close[:-1]

这会将百分比变化的整个数组计算为一个语句，而不是

pct_change = []
for row in close:
    pct_change.append(...)

因此尝试避免Python 循环 for i, row in enumerate(...) 完全，并且
考虑如何通过对整个数组（或数据帧）作为一个整体而不是逐行进行操作来执行计算。

Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

回复收藏 0 原文

画中仙 2024-12-17 21:06:33

就像之前提到的，pandas 对象在一次处理整个数组时效率最高。然而，对于那些真正需要循环 Pandas DataFrame 来执行某些操作的人（比如我），我发现了至少三种方法来做到这一点。我做了一个简短的测试，看看这三个中哪一个最耗时。

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

结果：

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

这可能不是衡量时间消耗的最佳方法，但对我来说很快。

以下是一些优点和缺点恕我直言：

.iterrows()：在单独的变量中返回索引和行项，但速度明显慢
.itertuples()：比 .iterrows() 更快，但返回索引和行项， ir[0] 是索引
zip：最快，但无法访问该行的索引

编辑 2020/11/10

对于它的价值，这里是一个更新的基准和一些其他替代方案（perf 与MacBookPro 2.4 GHz Intel Core i9 8 核 32 Go 2667 MHz DDR4)

import sys
import tqdm
import time
import pandas as pd

B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
    C = []
    A = time.time()
    for i,r in t.iterrows():
        C.append((r['a'], r['b']))
    B.append({"method": "iterrows", "time": time.time()-A})

    C = []
    A = time.time()
    for ir in t.itertuples():
        C.append((ir[1], ir[2]))
    B.append({"method": "itertuples", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(t['a'], t['b']):
        C.append((r[0], r[1]))
    B.append({"method": "zip", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(*t.to_dict("list").values()):
        C.append((r[0], r[1]))
    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})

    C = []
    A = time.time()
    for r in t.to_dict("records"):
        C.append((r["a"], r["b"]))
    B.append({"method": "to_dict('records')", "time": time.time()-A})

    A = time.time()
    t.agg(tuple, axis=1).tolist()
    B.append({"method": "agg", "time": time.time()-A})

    A = time.time()
    t.apply(tuple, axis=1).tolist()
    B.append({"method": "apply", "time": time.time()-A})

print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)

## Output

Python 3.7.9 (default, Oct 13 2020, 10:58:24) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
                           mean       std
method                                   
zip + to_dict('list')  0.002353  0.000168
zip                    0.003381  0.000250
itertuples             0.007659  0.000728
to_dict('records')     0.025838  0.001458
agg                    0.066391  0.007044
apply                  0.067753  0.006997
iterrows               0.647215  0.019600

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

import sys
import tqdm
import time
import pandas as pd

B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
    C = []
    A = time.time()
    for i,r in t.iterrows():
        C.append((r['a'], r['b']))
    B.append({"method": "iterrows", "time": time.time()-A})

    C = []
    A = time.time()
    for ir in t.itertuples():
        C.append((ir[1], ir[2]))
    B.append({"method": "itertuples", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(t['a'], t['b']):
        C.append((r[0], r[1]))
    B.append({"method": "zip", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(*t.to_dict("list").values()):
        C.append((r[0], r[1]))
    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})

    C = []
    A = time.time()
    for r in t.to_dict("records"):
        C.append((r["a"], r["b"]))
    B.append({"method": "to_dict('records')", "time": time.time()-A})

    A = time.time()
    t.agg(tuple, axis=1).tolist()
    B.append({"method": "agg", "time": time.time()-A})

    A = time.time()
    t.apply(tuple, axis=1).tolist()
    B.append({"method": "apply", "time": time.time()-A})

print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)

## Output

Python 3.7.9 (default, Oct 13 2020, 10:58:24) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
                           mean       std
method                                   
zip + to_dict('list')  0.002353  0.000168
zip                    0.003381  0.000250
itertuples             0.007659  0.000728
to_dict('records')     0.025838  0.001458
agg                    0.066391  0.007044
apply                  0.067753  0.006997
iterrows               0.647215  0.019600

回复收藏 0 原文

梦境 2024-12-17 21:06:33

您可以通过转置然后调用 iteritems 来循环遍历行：

for date, row in df.T.iteritems():
   # do some logic here

我不确定这种情况下的效率。为了在迭代算法中获得最佳性能，您可能想要探索在 Cython 中编写它，因此您可以执行以下操作：

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

我建议编写首先使用纯 Python 编写算法，确保它有效并查看它有多快 - 如果不够快，可以像这样以最少的工作将内容转换为 Cython，以获得与手动编码 C/C++ 一样快的内容。

You can loop through the rows by transposing and then calling iteritems:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

回复收藏 0 原文

恍梦境° 2024-12-17 21:06:33

您有三个选择：

通过索引（最简单）：

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

使用 iterrows（最常用）：

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

使用 itertuples （最快）：

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

三个选项显示类似：

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

来源：alphons.io

You have three options:

By index (simplest):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

With iterrows (most used):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

With itertuples (fastest):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

Three options display something like:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

Source: alphons.io

回复收藏 0 原文

苏别ゝ 2024-12-17 21:06:33

在注意到Nick Crawford的答案后，我查看了iterrows，但发现它产生了(index ，系列）元组。不确定哪种方法最适合您，但我最终使用了 itertuples 方法来解决我的问题，该方法产生 (index, row_value1...) 元组。

还有 iterkv，它迭代（列、系列）元组。

回复收藏 0 原文

你的呼吸 2024-12-17 21:06:33

作为一个小补充，如果您有一个应用于单个列的复杂函数，您也可以执行 apply ：

http://pandas.pydata.org/pandas-docs/dev/ generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

回复收藏 0 原文

攀登最高峰 2024-12-17 21:06:33

正如 @joris 指出的，iterrows 比 itertuples 慢得多并且 itertuples 比 iterrows 快大约 100 倍，我在具有 500 万条记录的 DataFrame 中测试了这两种方法的速度，结果是iterrows 为 1200it/s，itertuples 为 120000it/s。

如果使用itertuples，请注意，for循环中的每个元素都是一个namedtuple，因此要获取每一列中的值，可以参考以下示例代码

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

回复收藏 0 原文

流绪微梦 2024-12-17 21:06:33

当然，迭代数据帧的最快方法是通过 df.values 访问底层 numpy ndarray （正如您所做的那样）或单独访问每列 df.column_name.values /代码>。由于您也想访问索引，因此可以使用 df.index.values 来实现。

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

不是蟒蛇式的吗？当然。但速度很快。

如果您想从循环中榨取更多汁液，您需要查看 cython。 Cython 将使您获得巨大的加速（想想 10 倍到 100 倍）。为了获得最佳性能，请检查 cython 的内存视图。

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

Not pythonic? Sure. But fast.

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

回复收藏 0 原文

伴随着你 2024-12-17 21:06:33

另一个建议是将 groupby 与矢量化计算结合起来，如果行的子集共享允许您这样做的特征。

回复收藏 0 原文

樱花坊 2024-12-17 21:06:33

看最后一张

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in range(len(t)):
    C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B

0.46424
0.00505
0.00245
0.09879
0.00209

look at last one

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(round(time.time()-A,5))

C = []
A = time.time()
for r in range(len(t)):
    C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))

C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B

0.46424
0.00505
0.00245
0.09879
0.00209

回复收藏 0 原文

埋情葬爱 2024-12-17 21:06:33

我相信循环 DataFrames 最简单、最有效的方法是使用 numpy 和 numba。在这种情况下，在许多情况下，循环的速度大约与矢量化操作一样快。如果 numba 不是一个选择，那么普通 numpy 可能是下一个最佳选择。正如多次指出的，您的默认值应该是矢量化，但这个答案仅考虑有效循环，无论出于何种原因决定循环。

对于测试用例，让我们使用 @DSM 的答案中计算百分比变化的示例。这是一个非常简单的情况，实际上，您不会编写循环来计算它，但因此它为定时矢量化方法与循环提供了合理的基线。

让我们用一个小的 DataFrame 设置 4 种方法，然后我们将在下面的更大的数据集上对它们进行计时。

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

以下是具有 100,000 行的 DataFrame 上的计时（使用 Jupyter 的 %timeit 函数执行的计时，为了便于阅读而折叠到汇总表中）：

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

摘要：对于像这样的简单情况，您可以使用（矢量化）pandas 是为了简单性和可读性，（矢量化）numpy 是为了速度。如果您确实需要使用循环，请在 numpy 中执行。如果 numba 可用，请将其与 numpy 结合使用以提高速度。在这种情况下，numpy + numba 几乎与矢量化 numpy 代码一样快。

其他细节：

未显示各种选项，如 iterrows、itertuples 等，它们的速度要慢几个数量级，并且永远不应该使用。
这里的时间相当典型：numpy 比 pandas 快，矢量化比循环快，但将 numba 添加到 numpy 通常会显着加快 numpy 的速度。
除了 pandas 选项之外的所有内容都需要将 DataFrame 列转换为 numpy 数组。该转换包含在计时中。
定义/编译 numpy/numba 函数的时间不包括在计时中，但通常是任何大型数据帧的计时中可以忽略不计的组成部分。

I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.

For a test case, let's use the example from @DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.

Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.

Other details:

Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

回复收藏 0 原文