如何使用 Pandas groupby 函数计算上一年的平均值?

发布于 2025-01-11 17:25:45 字数 2174 浏览 0 评论 0原文

我正在尝试寻找一种方法来查找玩家“上赛季”(上一年)的平均得分,并将其添加到原始数据帧df的新列中。

我编写了一个公式来获取玩家当年的平均得分(不包括当前行),如下所示:

df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
                   .apply(lambda x: x.shift(1).expanding().mean())

然而,尽管我尽了最大努力使用 shift 函数,但我还是无法完全工作如何直接将前几年的平均值(“Last Season Avg”)计算到新列中。

数据框设置如下:

PlayerDateTimeScoreSeason Avg
PlayerB2020-MM-DD HH:MM:SS40NaN
PlayerA2020-MM-DD HH:MM:SS50NaN
PlayerA2021-MM-DD HH:MM:SS100NaN
玩家B2021-MM-DD HH:MM:SS200NaN
玩家A2021-MM-DD HH:MM:SS160100
玩家 B2021-MM-DD HH:MM:SS140200
玩家 B2021-MM-DD HH:MM:SS160170
玩家 A2021-MM-DD HH:MM:SS200130

我想要的新的理想数据框:

玩家日期时间得分赛季平均上赛季平均
玩家B2020-MM-DD HH:MM:SS40NaNNaN
玩家A 2020-MM-DD HH:MM:SS50NaNNaN
玩家A2021-MM-DD HH:MM:SS100NaN50
玩家B2021-MM-DD HH :MM:SS200NaN40
玩家A 2021-MM-DD HH:MM:SS16010050
玩家 B2021-MM-DD HH :MM:SS 14020040
玩家B2021-MM-DD HH:MM:SS16017040
玩家 A2021-MM-DD HH:MM:SS20013050

I am trying to look for a method to find a Player's mean score for the "Last Season" (Previous Year) and add it in a new column in the original dataframe df.

I have coded a formula to get a Player's mean score for the current year, excluding the current row, which is as follows:

df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
                   .apply(lambda x: x.shift(1).expanding().mean())

However, despite my best attempt at using the shift function, I can not quite work out how to calculate the previous years mean ("Last Season Avg") directly into a new column.

The dataframe is set out as follows:

PlayerDateTimeScoreSeason Avg
PlayerB2020-MM-DD HH:MM:SS40NaN
PlayerA2020-MM-DD HH:MM:SS50NaN
PlayerA2021-MM-DD HH:MM:SS100NaN
PlayerB2021-MM-DD HH:MM:SS200NaN
PlayerA2021-MM-DD HH:MM:SS160100
PlayerB2021-MM-DD HH:MM:SS140200
PlayerB2021-MM-DD HH:MM:SS160170
PlayerA2021-MM-DD HH:MM:SS200130

The new ideal dataframe that I would like:

PlayerDateTimeScoreSeason AvgLast Season Avg
PlayerB2020-MM-DD HH:MM:SS40NaNNaN
PlayerA2020-MM-DD HH:MM:SS50NaNNaN
PlayerA2021-MM-DD HH:MM:SS100NaN50
PlayerB2021-MM-DD HH:MM:SS200NaN40
PlayerA2021-MM-DD HH:MM:SS16010050
PlayerB2021-MM-DD HH:MM:SS14020040
PlayerB2021-MM-DD HH:MM:SS16017040
PlayerA2021-MM-DD HH:MM:SS20013050

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

醉酒的小男人 2025-01-18 17:25:45

您可以按“玩家”和年份groupby一次来查找每个玩家的年平均值;然后 groupby "Player" + shift 获取上一年的上一年的平均值。

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()

如果您要查找特定赛季之前的职业生涯平均值,可以使用 expanding().mean()

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()

编辑:

提供了示例数据后,让我们对其进行测试。这里的主要问题是“年份”有重复的值。 @PaulRougieux 处理得非常优雅。这是另一种选择。这个想法是找到上赛季的平均值并将其映射回原始的 df (而不是对其进行转换)。

df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
                             .map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
                                  .groupby(level=0).shift()))

输出:

    Player             DateTime  Score  Season Avg  Last Season Avg
0  PlayerB  2020-MM-DD HH:MM:SS     40         NaN              NaN
1  PlayerA  2020-MM-DD HH:MM:SS     50         NaN              NaN
2  PlayerA  2021-MM-DD HH:MM:SS    100         NaN             50.0
3  PlayerB  2021-MM-DD HH:MM:SS    200         NaN             40.0
4  PlayerA  2021-MM-DD HH:MM:SS    160       100.0             50.0
5  PlayerB  2021-MM-DD HH:MM:SS    140       200.0             40.0
6  PlayerB  2021-MM-DD HH:MM:SS    160       170.0             40.0
7  PlayerA  2021-MM-DD HH:MM:SS    200       130.0             50.0

You can groupby once by the "Player" and the year to find the yearly average for each player; then groupby "Player" + shift to get the previous year's previous year's averages.

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()

If you're looking for career averages until a particular season, you could use expanding().mean():

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()

Edit:

With sample data provided, let's test it. The main problem here is that "Year"s have duplicate values. @PaulRougieux handles it very elegantly. Here's another option. The idea is to find last season's averages and map it back to the original df (instead of transforming it).

df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
                             .map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
                                  .groupby(level=0).shift()))

Output:

    Player             DateTime  Score  Season Avg  Last Season Avg
0  PlayerB  2020-MM-DD HH:MM:SS     40         NaN              NaN
1  PlayerA  2020-MM-DD HH:MM:SS     50         NaN              NaN
2  PlayerA  2021-MM-DD HH:MM:SS    100         NaN             50.0
3  PlayerB  2021-MM-DD HH:MM:SS    200         NaN             40.0
4  PlayerA  2021-MM-DD HH:MM:SS    160       100.0             50.0
5  PlayerB  2021-MM-DD HH:MM:SS    140       200.0             40.0
6  PlayerB  2021-MM-DD HH:MM:SS    160       170.0             40.0
7  PlayerA  2021-MM-DD HH:MM:SS    200       130.0             50.0
风追烟花雨 2025-01-18 17:25:45

创建样本数据集

import pandas
import numpy as np
df = pandas.DataFrame(
    {'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
     'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
                  '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
     'score': [40, 50, 100, 200, 160, 140, 160, 200],
    }
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

使用转换将当前季节平均值添加到数据框

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

  player   datetime  score  year  season_avg
0      B 2020-01-01     40  2020   40.000000
1      A 2020-01-01     50  2020   50.000000
2      A 2021-01-01    100  2021  153.333333
3      B 2021-01-01    200  2021  166.666667
4      A 2021-01-01    160  2021  153.333333
5      B 2021-01-01    140  2021  166.666667
6      B 2021-01-01    160  2021  166.666667
7      A 2021-01-01    200  2021  153.333333

此处无法应用移位,因为年份会重复

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0           NaN
1           NaN
2     50.000000
3     40.000000
4    153.333333
5    166.666667
6    166.666667
7    153.333333
Name: season_avg, dtype: float64

计算上一年的平均值并将它们连接到原始数据框

savg = (df.groupby(["year", "player"])
        .agg(last_season_avg = ("score", "mean"))
        .reset_index())
savg["year"] = savg["year"] + 1
savg

   year player  last_season_avg
0  2021      A        50.000000
1  2021      B        40.000000
2  2022      A       153.333333
3  2022      B       166.666667

df.merge(savg, on=["player", "year"], how="left" )

  player   datetime  score  year  season_avg  last_season_avg
0      B 2020-01-01     40  2020   40.000000              NaN
1      A 2020-01-01     50  2020   50.000000              NaN
2      A 2021-01-01    100  2021  153.333333             50.0
3      B 2021-01-01    200  2021  166.666667             40.0
4      A 2021-01-01    160  2021  153.333333             50.0
5      B 2021-01-01    140  2021  166.666667             40.0
6      B 2021-01-01    160  2021  166.666667             40.0
7      A 2021-01-01    200  2021  153.333333             50.0

计算上一年平均值的另一种方法,使用 shift 可能比使用 year + 1 更优雅。

savg = (df.groupby(["year", "player"])
        .agg(season_avg = ("score", "mean"))
        .reset_index()
        .sort_values(["year"])
       )
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")

Create a sample data set

import pandas
import numpy as np
df = pandas.DataFrame(
    {'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
     'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
                  '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
     'score': [40, 50, 100, 200, 160, 140, 160, 200],
    }
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

Use transform to add the current season average to the data frame

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

  player   datetime  score  year  season_avg
0      B 2020-01-01     40  2020   40.000000
1      A 2020-01-01     50  2020   50.000000
2      A 2021-01-01    100  2021  153.333333
3      B 2021-01-01    200  2021  166.666667
4      A 2021-01-01    160  2021  153.333333
5      B 2021-01-01    140  2021  166.666667
6      B 2021-01-01    160  2021  166.666667
7      A 2021-01-01    200  2021  153.333333

Shift cannot be applied here because years are repeated

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0           NaN
1           NaN
2     50.000000
3     40.000000
4    153.333333
5    166.666667
6    166.666667
7    153.333333
Name: season_avg, dtype: float64

Compute the average from the previous year and join them to the original dataframe

savg = (df.groupby(["year", "player"])
        .agg(last_season_avg = ("score", "mean"))
        .reset_index())
savg["year"] = savg["year"] + 1
savg

   year player  last_season_avg
0  2021      A        50.000000
1  2021      B        40.000000
2  2022      A       153.333333
3  2022      B       166.666667

df.merge(savg, on=["player", "year"], how="left" )

  player   datetime  score  year  season_avg  last_season_avg
0      B 2020-01-01     40  2020   40.000000              NaN
1      A 2020-01-01     50  2020   50.000000              NaN
2      A 2021-01-01    100  2021  153.333333             50.0
3      B 2021-01-01    200  2021  166.666667             40.0
4      A 2021-01-01    160  2021  153.333333             50.0
5      B 2021-01-01    140  2021  166.666667             40.0
6      B 2021-01-01    160  2021  166.666667             40.0
7      A 2021-01-01    200  2021  153.333333             50.0

Another way to compute the average from the previous year, using shift is maybe more elegant than doing year + 1.

savg = (df.groupby(["year", "player"])
        .agg(season_avg = ("score", "mean"))
        .reset_index()
        .sort_values(["year"])
       )
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文