Pandas:对每组的特征异常值进行缩尾处理
我有包含 100 个特征的数据框,我想对每个“组”的异常值进行缩尾处理。 您可以使用以下代码来生成数据帧。
import numpy as np
import pandas as pd
from scipy.stats import mstats
data = np.random.randint(1,999,size=(500,101))
cols = []
for i in range(101):
cols += [f'f_{i}']
df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])
现在我想对每组中的每个特征进行winsorize(而不是删除!)极值。
如果您不确定“winsorize”。这是一个示例:
在winsorize之前:
1, 2, 3, 4, 5 ... 97, 98, 99, 100
在winsorize之后最小和最大1%:
2, 2, 3, 4, 5 ... 97, 98, 99, 99
我知道如何使用以下代码对整个数据帧的每个特征进行winsorize极端1%值。
for col in df.columns:
df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])
但是,我想对每个组的每个功能进行缩尾。
有人可以建议吗? 谢谢 !
I am having dataframe with 100 features and I want to winsorize outliers for each 'group'.
You can use the following code to generate the dataframe.
import numpy as np
import pandas as pd
from scipy.stats import mstats
data = np.random.randint(1,999,size=(500,101))
cols = []
for i in range(101):
cols += [f'f_{i}']
df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])
Now I want to winsorize (NOT delete !) extreme values for each feature in each group.
If you are not sure about 'winsorize'. Here is an example:
Before winsorize:
1, 2, 3, 4, 5 ... 97, 98, 99, 100
After winsorize the smallest and largest 1%:
2, 2, 3, 4, 5 ... 97, 98, 99, 99
I know how to winsorize extreme 1% values for each featrues for the entire dataframe by using the following code.
for col in df.columns:
df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])
However, I want to winsorize for each features for each group.
Can anyone please advise ?
Thank you !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一定有比这更优雅的方法,但它似乎对我有用,而且只是对您的解决方案的一个微小补充:
如您所见,除了列之外,我还迭代组,并用简单的方法解决问题对每一列进行过滤。
There must be a more elegant way than this, but it seems to work for me and it's just a tiny addition to your solution:
As you can see, I also iterate through the groups in addition to the columns, and solve the problem with simple filtering of each column.