什么是pandas中每个组中特定值的发生,过滤和计算特定值的有效方法?
嗨,怎么样?我有一个巨大的数据框架,正在尝试进行一个组,过滤,然后在每个组中计数特定事件的发生。我有工作的代码,但根本不适合扩展,这需要永远运行。有人可以通过快速执行相同计算的方式帮助我吗?以下是我到目前为止在一个虚拟示例中重现的内容:
dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']
df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events})
def sneeze_by_weather(df):
num_sneeze = df[df['event']=='sneeze'].shape[0]
if num_sneeze==0:
return 0
else:
return num_sneeze
df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()
是否有任何方法可以更快地执行此计算,以便在我有数百万行时缩放?
Hi how's it going? I have a giant dataframe and am trying to do a groupby, filter, then count within each group the occurrence of a particular event. The code I have works but doesn't scale well at all, it takes forever to run. Can someone help me with a fast way to perform the same computation? Below is what I have so far reproduced in a dummy example:
dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']
df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events})
def sneeze_by_weather(df):
num_sneeze = df[df['event']=='sneeze'].shape[0]
if num_sneeze==0:
return 0
else:
return num_sneeze
df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()
Is there any way to perform this computation much faster so that it scales when I have millions of rows?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在下面的我的代码中,您提供的数据框架的%%时间IT:
最慢的运行时间比最快的时间长4.03倍。这可能意味着中间结果被缓存。
1000循环,最佳5:495 µs每循环
%%timeit for your provided dataframe with my code below:
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 495 µs per loop
这应该是更快的
输出:
其他选项是使用
Merge
我过度复杂...实际上,这种方式更加简单
This should be faster
Output:
Other option is to use
merge
I was overcomplicating... actually, this way is much more straightforward