熊猫根据两列的组查找和替换异常值
我在基于组和日期的DF中找到离群值时遇到了一些麻烦。
对于Exampe,我有一个DF,我想找到并替换异常值的中位数(在2022-06-27日为A组为10,在2022-06-27上为B组为20)组(第一个异常值3,第二个离群值为4)。
但是,我在过滤数据并隔离异常值并替换它们时遇到了一些麻烦。
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r
print (df)
group date val
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 10
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 20
感谢您的帮助!
I'm having a bit of trouble finding outliers in a df based on groups and dates.
For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).
However I'm having some trouble filtering the data and isolating the outliers and replacing them.
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r
print (df)
group date val
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 10
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 20
Thanks for the help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您可以识别异常值。该代码标识所有远离平均值的标准偏差的值。
outliers = df.loc [(df.value -df.value.mean())。abs()> df.value.std() * 1] .index
然后您可以确定每个组的中位数:
中值= df.groupby('group')['value']。代码>
最后,找到异常值并替换为中位数:
df.loc [utliers,'value'] = mentians.loc [df.loc [outliers,'group'group'']]。to_list()
总共看起来像:
输出:
First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
Then you can determine the median of each group:
medians = df.groupby('group')['value'].median()
Finally, locate the outliers and replace with the medians:
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()
All together it looks like:
Output:
您可以使用
.groupby/transform
的组合来获取每个分组的中位数,然后掩盖您的原始数据,以与离群值相对于异常值,并填充这些中位数。You can use a combination of
.groupby/transform
to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.