熊猫根据两列的组查找和替换异常值

发布于 2025-02-11 06:46:10 字数 1005 浏览 0 评论 0原文

我在基于组和日期的DF中找到离群值时遇到了一些麻烦。

对于Exampe,我有一个DF,我想找到并替换异常值的中位数(在2022-06-27日为A组为10,在2022-06-27上为B组为20)组(第一个异常值3,第二个离群值为4)。

但是,我在过滤数据并隔离异常值并替换它们时遇到了一些麻烦。

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

感谢您的帮助!

I'm having a bit of trouble finding outliers in a df based on groups and dates.

For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).

However I'm having some trouble filtering the data and isolating the outliers and replacing them.

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

Thanks for the help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

七禾 2025-02-18 06:46:10

首先,您可以识别异常值。该代码标识所有远离平均值的标准偏差的值。

outliers = df.loc [(df.value -df.value.mean())。abs()> df.value.std() * 1] .index

然后您可以确定每个组的中位数:

中值= df.groupby('group')['value']。代码>

最后,找到异常值并替换为中位数:

df.loc [utliers,'value'] = mentians.loc [df.loc [outliers,'group'group'']]。to_list()

总共看起来像:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

输出:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3
左耳近心 2025-02-18 06:46:10

您可以使用.groupby/transform的组合来获取每个分组的中位数,然后掩盖您的原始数据,以与离群值相对于异常值,并填充这些中位数。

medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)

print(df)
   group        date  value  new_value
0      A  2022-06-28      1        1.0
1      A  2022-06-28      2        2.0
2      A  2022-06-28      1        1.0
3      A  2022-06-27      2        2.0
4      A  2022-06-27      3        3.0
5      A  2022-06-27     10        2.0
6      B  2022-06-28      2        2.0
7      B  2022-06-28      3        3.0
8      B  2022-06-28      2        2.0
9      B  2022-06-27      3        3.0
10     B  2022-06-27      4        4.0
11     B  2022-06-27     20        3.0

You can use a combination of .groupby/transform to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.

medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)

print(df)
   group        date  value  new_value
0      A  2022-06-28      1        1.0
1      A  2022-06-28      2        2.0
2      A  2022-06-28      1        1.0
3      A  2022-06-27      2        2.0
4      A  2022-06-27      3        3.0
5      A  2022-06-27     10        2.0
6      B  2022-06-28      2        2.0
7      B  2022-06-28      3        3.0
8      B  2022-06-28      2        2.0
9      B  2022-06-27      3        3.0
10     B  2022-06-27      4        4.0
11     B  2022-06-27     20        3.0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文