有没有更快的方法来按累积平均值进行 Pandas 分组?
我正在尝试在 Python 中创建一个查找参考表,用于计算玩家之前(按日期时间)游戏得分的累积平均值
,并按场地分组。但是,对于我的特定需求,玩家之前应该在相关场地至少玩过 2 次,以便计算“场地偏好”
累积平均值
。
df
格式如下所示:
DateTime | Player | Venue | Score |
---|---|---|---|
2021-09-25 17:15:00 | Tim | Stadium A | 20 |
2021-09-27 10:00:00 | Blake | Stadium B | 30 |
我现有的代码工作完美,但不幸的是非常慢,如下:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
我确信有一种方法可以一步计算累积平均值
,而无需先计算累积总和
和累积计数
,但不幸的是我无法让它发挥作用。
I am trying to create a lookup reference table in Python that calculates the cumulative mean
of a Player's previous (by datetime
) games scores, grouped by venue. However, for my specific need, a player should have previously played a minimum of 2 times at the relevant Venue for a 'Venue Preference'
cumulative mean
calculation.
df
format looks like the following:
DateTime | Player | Venue | Score |
---|---|---|---|
2021-09-25 17:15:00 | Tim | Stadium A | 20 |
2021-09-27 10:00:00 | Blake | Stadium B | 30 |
My existing code that works perfectly, but unfortunately is very slow, is as follows:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
I am sure there is a way to calculate the cumulative mean
in one step without first calculating the cumulative sum
and cumulative count
, but unfortunately I couldn't get that to work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
IIUC 首先通过
sum
和size
进行聚合,然后通过两列进行累积总和来删除 2 个 groupby:IIUC remove 2 groupby by aggregate by
sum
andsize
first and then cumulative sum by both columns: