Pandas：对每组的特征异常值进行缩尾处理

发布于 2025-01-11 14:18:39 字数 835 浏览 0 评论 0原文

我有包含 100 个特征的数据框，我想对每个“组”的异常值进行缩尾处理。您可以使用以下代码来生成数据帧。

import numpy as np
import pandas as pd
from scipy.stats import mstats

data = np.random.randint(1,999,size=(500,101))

cols = []
for i in range(101):
   cols += [f'f_{i}']  

df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])

现在我想对每组中的每个特征进行winsorize（而不是删除！）极值。

如果您不确定“winsorize”。这是一个示例：

在winsorize之前：

1, 2, 3, 4, 5 ... 97, 98, 99, 100

在winsorize之后最小和最大1％：

2, 2, 3, 4, 5 ... 97, 98, 99, 99

我知道如何使用以下代码对整个数据帧的每个特征进行winsorize极端1％值。

for col in df.columns:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

但是，我想对每个组的每个功能进行缩尾。

有人可以建议吗？谢谢！

原文

I am having dataframe with 100 features and I want to winsorize outliers for each 'group'.
You can use the following code to generate the dataframe.

import numpy as np
import pandas as pd
from scipy.stats import mstats

data = np.random.randint(1,999,size=(500,101))

cols = []
for i in range(101):
   cols += [f'f_{i}']  

df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])

Now I want to winsorize (NOT delete !) extreme values for each feature in each group.

If you are not sure about 'winsorize'. Here is an example:

Before winsorize:

1, 2, 3, 4, 5 ... 97, 98, 99, 100

After winsorize the smallest and largest 1%:

2, 2, 3, 4, 5 ... 97, 98, 99, 99

I know how to winsorize extreme 1% values for each featrues for the entire dataframe by using the following code.

for col in df.columns:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

However, I want to winsorize for each features for each group.

Can anyone please advise ?
Thank you !

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

勿忘初心 2025-01-18 14:18:39

一定有比这更优雅的方法，但它似乎对我有用，而且只是对您的解决方案的一个微小补充：

for col in df.columns:
    for group in df.group.unique():
        df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])

如您所见，除了列之外，我还迭代组，并用简单的方法解决问题对每一列进行过滤。

There must be a more elegant way than this, but it seems to work for me and it's just a tiny addition to your solution:

for col in df.columns:
    for group in df.group.unique():
        df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])

As you can see, I also iterate through the groups in addition to the columns, and solve the problem with simple filtering of each column.

回复收藏 0 原文

~没有更多了~