根据条件在 Dataframe 中乘以多列的最快方法

发布于 2025-01-17 19:12:56 字数 485 浏览 1 评论 0 原文

data = [{'a': 12, 'b': 23, 'c':34, 'd': 0.1, 'e':25},
        {'a':13, 'b': 26, 'c': 38, 'd': 0.02, 'e':26},
        {'a':19, 'b': 28, 'c': 31, 'd': 0.04, 'e':22}
       ]
 
# Creates DataFrame.
df = pd.DataFrame(data)

     a   b   c    d     e
0   12  23  34  0.10    25
1   13  26  38  0.02    26
2   19  28  31  0.04    22

我有一个非常大的数据框，由 20 列和 2000 万行以上组成，我想将某些列乘以 d 列。

例如，在本例中，我想将 a、c 和 e 列乘以 d 列中的百分比我想知道执行此操作的最快方法是什么

原文

data = [{'a': 12, 'b': 23, 'c':34, 'd': 0.1, 'e':25},
        {'a':13, 'b': 26, 'c': 38, 'd': 0.02, 'e':26},
        {'a':19, 'b': 28, 'c': 31, 'd': 0.04, 'e':22}
       ]
 
# Creates DataFrame.
df = pd.DataFrame(data)

     a   b   c    d     e
0   12  23  34  0.10    25
1   13  26  38  0.02    26
2   19  28  31  0.04    22

I have a very large dataframe consisting of 20 cols and 20million+ rows, I would like to multiply certain columns by column d.

For example in this case I want to multiply columns a,c, and e by the percentage in column d.I would like to know what is the quickest way to do this

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初相遇 2025-01-24 19:12:56

如果由列列表选择多个值， dataframe.mul 它是快速：

cols = ['a','c','e']
df[cols] = df[cols].mul(df['d'], axis=0)
print (df)
      a   b     c     d     e
0  1.20  23  3.40  0.10  2.50
1  0.26  26  0.76  0.02  0.52
2  0.76  28  1.24  0.04  0.88

numpy替代方案，但不快：

cols = ['a','c','e']
df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]

df = pd.DataFrame(data)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
print (df)


In [113]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].mul(df['d'], axis=0)
     ...: 
     ...: 
14.5 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [114]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
     ...: 
138 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If multiple values selected by list of columns names by DataFrame.mul it is fast:

cols = ['a','c','e']
df[cols] = df[cols].mul(df['d'], axis=0)
print (df)
      a   b     c     d     e
0  1.20  23  3.40  0.10  2.50
1  0.26  26  0.76  0.02  0.52
2  0.76  28  1.24  0.04  0.88

Numpy alternative, but not faster:

cols = ['a','c','e']
df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]

df = pd.DataFrame(data)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
print (df)


In [113]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].mul(df['d'], axis=0)
     ...: 
     ...: 
14.5 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [114]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
     ...: 
138 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回复收藏 0 原文

~没有更多了~