如何计算数据框中每一行的条件聚合度量?
我有一个这样的表...
日期 | PlayerId | 进球 |
---|---|---|
六月 1 日 | A | 1 |
六月 14 | A | 1 |
六月 15 | B | 2 |
六月 28 | A | 1 |
七月 6 日 | B | 0 |
七月 17 日 | A | 1 |
我想计算一名球员的进球数前 30 天(不是 30 场比赛)。最终结果应如下所示...
Date | PlayerId | goal | goal_Prev_30 |
---|---|---|---|
June 1 | A | 1 | 0 |
June 14 | A | 1 | 1 |
June 15 | B | 2 | 0 |
June 28 | A | 1 | 2 |
July 6th | B | 0 | 2 |
July 17th | A | 1 | 1 |
我创建了一个用于过滤的 for 循环识别数据框中的单行,然后根据行的特征过滤数据框,然后计算过滤后的数据框中的目标总和,将其附加到一个列表,最终分配给Goals_Prev_30列。代码看起来像......
30_day_goals = []
for i in range(len(df)):
row = df.iloc[i]
filtered_df = df[(df['Date'] < row['Date']) & (df['Date'] >= row['Date'])- pd.to_timedelta(30,unit='d')) & (df['PlayerId'] == row['PlayerId'])]
total = filtered_df['Goals'].sum()
30_day_goals.append(total)
df['Goals_Prev_30'] = 30_day_goals
这个解决方案有效,但速度很慢。它每秒可以处理大约 30 行,但这不是一个可行的解决方案,因为我有多个类似的度量,并且有超过 120 万行。这意味着每项测量大约需要 11 小时才能完成。
如何以更有效的方式解决这个问题?
I have a table like this...
Date | PlayerId | Goals |
---|---|---|
June 1 | A | 1 |
June 14 | A | 1 |
June 15 | B | 2 |
June 28 | A | 1 |
July 6th | B | 0 |
July 17th | A | 1 |
I would like to calculate the amount of goals a player had scored in the 30 days prior (NOT 30 games). The final results should look like...
Date | PlayerId | Goals | Goals_Prev_30 |
---|---|---|---|
June 1 | A | 1 | 0 |
June 14 | A | 1 | 1 |
June 15 | B | 2 | 0 |
June 28 | A | 1 | 2 |
July 6th | B | 0 | 2 |
July 17th | A | 1 | 1 |
I created a for loop that filters that identifies a single row in the dataframe, then filters the dataframe by characteristics of the row, then calculates the sum of goals in the filtered dataframe, appends it to a list, which is finally assigned to the Goals_Prev_30 column. The code looks like...
30_day_goals = []
for i in range(len(df)):
row = df.iloc[i]
filtered_df = df[(df['Date'] < row['Date']) & (df['Date'] >= row['Date'])- pd.to_timedelta(30,unit='d')) & (df['PlayerId'] == row['PlayerId'])]
total = filtered_df['Goals'].sum()
30_day_goals.append(total)
df['Goals_Prev_30'] = 30_day_goals
This solution works, but it's slow. It can do around 30 rows a second, however it's not a viable solution as I have multiple measures that are similar and there are over 1.2M rows. This means it will take around 11hrs per measure to complete.
How can this problem be solved in a more efficient manner?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我将您的解决方案更改为每个组的自定义函数,并使用广播创建的掩码和每个组的
Goals
列的sum
值(如果匹配):I change your solution to custom function per groups with mask created by broadcasting and
sum
values ofGoals
column per groups if match: