如何计算由两个变量分组的过去n年的平均值

发布于 2025-01-14 19:11:00 字数 2316 浏览 1 评论 0原文

首先,我想说我查看了[这个答案][1],但我无法继续处理那里的信息。

所以我有一个这样的数据集,

df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})

我不知道如何输入 NaN,所以我这样做:

df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])

我想按 ID 和主题获取滚动 N(在本例中为 3)年平均分。

我可以通过 ID 获取 3 年滚动平均分数:

df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg( {'分数':'平均值', '年份': '最大值'}).reset_index(level=0).groupby( '年份').transform('shift')['分数'] df = df.sort_values(['ID', 'Year'])

这给了我滚动平均值,但只是按 ID 的年份,也不是主题:

ID     Type   Year  Score   average_past_3_years
10010   Red   2018  14.0    NaN
10010   Blue  2018  24.0    NaN
10010   Red   2019  16.0    NaN
10010   Blue  2019  5.0     18.000000
10010   Red   2021  87.0    NaN
10010   Blue  2021  33.0    36.000000
10013   Red   2018  11.0    NaN
10013   Blue  2018  13.0    NaN
10013   Red   2019  3.0     15.000000
10013   Blue  2019  16.0    9.000000
10013   Red   2021  37.0    41.666667
10013   Blue  2021  49.0    18.666667

我试图输出红色的 10013,2021 年将是 17,不是 41,因为它只会计算红色分数。

我尝试过:

df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
                      {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
                      'Year').transform('shift')['Score']

但遇到了这个错误:

Grouper and axis must be same length

并且有点卡在那里

我也不确定是否需要事先排序

  [1]: https://stackoverflow.com/questions/61837481/pandas-how-to-calculate-mean-values-of-the-past-n-years-for-every-month

First off I would like to say I checked out [this answer][1] and I couldn't progress with the information there.

So I have a dataset as such

df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})

I don't know how to type in NaN so I make it like this:

df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])

I want to take the rolling N (In this case 3) Year average of score by ID and Subject.

I can get the 3 year rolling average score by ID:

df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg(
{'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
'Year').transform('shift')['Score']
df = df.sort_values(['ID', 'Year'])

That gives me the rolling average but only by year on ID, not subject as well:

ID     Type   Year  Score   average_past_3_years
10010   Red   2018  14.0    NaN
10010   Blue  2018  24.0    NaN
10010   Red   2019  16.0    NaN
10010   Blue  2019  5.0     18.000000
10010   Red   2021  87.0    NaN
10010   Blue  2021  33.0    36.000000
10013   Red   2018  11.0    NaN
10013   Blue  2018  13.0    NaN
10013   Red   2019  3.0     15.000000
10013   Blue  2019  16.0    9.000000
10013   Red   2021  37.0    41.666667
10013   Blue  2021  49.0    18.666667

I am trying to out put say 10013 for red the 2021 would be 17, not 41 as it would only be calculating the red scores.

I tried:

df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
                      {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
                      'Year').transform('shift')['Score']

but got this error:

Grouper and axis must be same length

And kinda got stuck there

I'm also unsure if I need to sort before hand

  [1]: https://stackoverflow.com/questions/61837481/pandas-how-to-calculate-mean-values-of-the-past-n-years-for-every-month

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

姜生凉生 2025-01-21 19:11:00

IIUC 用途:

df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])

df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()

IIUC use:

df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])

df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文