熊猫根据两列的组查找和替换异常值

发布于 2025-02-11 06:46:10 字数 1005 浏览 0 评论 0原文

我在基于组和日期的DF中找到离群值时遇到了一些麻烦。

对于Exampe，我有一个DF，我想找到并替换异常值的中位数（在2022-06-27日为A组为10，在2022-06-27上为B组为20）组（第一个异常值3，第二个离群值为4）。

但是，我在过滤数据并隔离异常值并替换它们时遇到了一些麻烦。

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

感谢您的帮助！

原文

I'm having a bit of trouble finding outliers in a df based on groups and dates.

For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).

However I'm having some trouble filtering the data and isolating the outliers and replacing them.

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

Thanks for the help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七禾 2025-02-18 06:46:10

首先，您可以识别异常值。该代码标识所有远离平均值的标准偏差的值。

outliers = df.loc [（df.value -df.value.mean（））。abs（）＆gt; df.value.std（） * 1] .index

然后您可以确定每个组的中位数：

中值= df.groupby（'group'）['value']。代码>

最后，找到异常值并替换为中位数：

df.loc [utliers，'value'] = mentians.loc [df.loc [outliers，'group'group'']]。to_list（）

总共看起来像：

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

输出：

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

回复收藏 0 原文

左耳近心 2025-02-18 06:46:10

您可以使用.groupby/transform的组合来获取每个分组的中位数，然后掩盖您的原始数据，以与离群值相对于异常值，并填充这些中位数。

medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)

print(df)
   group        date  value  new_value
0      A  2022-06-28      1        1.0
1      A  2022-06-28      2        2.0
2      A  2022-06-28      1        1.0
3      A  2022-06-27      2        2.0
4      A  2022-06-27      3        3.0
5      A  2022-06-27     10        2.0
6      B  2022-06-28      2        2.0
7      B  2022-06-28      3        3.0
8      B  2022-06-28      2        2.0
9      B  2022-06-27      3        3.0
10     B  2022-06-27      4        4.0
11     B  2022-06-27     20        3.0

You can use a combination of .groupby/transform to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.

medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)

print(df)
   group        date  value  new_value
0      A  2022-06-28      1        1.0
1      A  2022-06-28      2        2.0
2      A  2022-06-28      1        1.0
3      A  2022-06-27      2        2.0
4      A  2022-06-27      3        3.0
5      A  2022-06-27     10        2.0
6      B  2022-06-28      2        2.0
7      B  2022-06-28      3        3.0
8      B  2022-06-28      2        2.0
9      B  2022-06-27      3        3.0
10     B  2022-06-27      4        4.0
11     B  2022-06-27     20        3.0

回复收藏 0 原文

~没有更多了~