Pandas:对每组的特征异常值进行缩尾处理

发布于 2025-01-11 14:18:39 字数 835 浏览 0 评论 0原文

我有包含 100 个特征的数据框,我想对每个“组”的异常值进行缩尾处理。 您可以使用以下代码来生成数据帧。

import numpy as np
import pandas as pd
from scipy.stats import mstats

data = np.random.randint(1,999,size=(500,101))

cols = []
for i in range(101):
   cols += [f'f_{i}']  

df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])

现在我想对每组中的每个特征进行winsorize(而不是删除!)极值。

如果您不确定“winsorize”。这是一个示例:

在winsorize之前:

1, 2, 3, 4, 5 ... 97, 98, 99, 100

在winsorize之后最小和最大1%:

2, 2, 3, 4, 5 ... 97, 98, 99, 99

我知道如何使用以下代码对整个数据帧的每个特征进行winsorize极端1%值。

for col in df.columns:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

但是,我想对每个组的每个功能进行缩尾。

有人可以建议吗? 谢谢 !

I am having dataframe with 100 features and I want to winsorize outliers for each 'group'.
You can use the following code to generate the dataframe.

import numpy as np
import pandas as pd
from scipy.stats import mstats

data = np.random.randint(1,999,size=(500,101))

cols = []
for i in range(101):
   cols += [f'f_{i}']  

df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])

Now I want to winsorize (NOT delete !) extreme values for each feature in each group.

If you are not sure about 'winsorize'. Here is an example:

Before winsorize:

1, 2, 3, 4, 5 ... 97, 98, 99, 100

After winsorize the smallest and largest 1%:

2, 2, 3, 4, 5 ... 97, 98, 99, 99

I know how to winsorize extreme 1% values for each featrues for the entire dataframe by using the following code.

for col in df.columns:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

However, I want to winsorize for each features for each group.

Can anyone please advise ?
Thank you !

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

勿忘初心 2025-01-18 14:18:39

一定有比这更优雅的方法,但它似乎对我有用,而且只是对您的解决方案的一个微小补充:

for col in df.columns:
    for group in df.group.unique():
        df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])

如您所见,除了列之外,我还迭代组,并用简单的方法解决问题对每一列进行过滤。

There must be a more elegant way than this, but it seems to work for me and it's just a tiny addition to your solution:

for col in df.columns:
    for group in df.group.unique():
        df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])

As you can see, I also iterate through the groups in addition to the columns, and solve the problem with simple filtering of each column.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文