knnimputer正在用无效替换数据

发布于 2025-02-04 03:56:38 字数 2910 浏览 3 评论 0原文

我正在研究一个具有敏感数据的项目,并偶然发现了这个“错误”(可能是我脑海中的东西)。最近,我从Sklearn中学到了Knnimputer,我喜欢它的概念。 但是,它正在用无效的值替换数据。我正在研究一个数据清洁和建模项目,目前我没有任何零数据,但我想添加代码以填充nulls情况将来会有,但是在运行算法后,它将用NAN值替换良好的数据。我使用的是错误的吗?

这是代码:

我使用转换的分类数据的库

from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

作为假人

df_cleaned = pd.get_dummies(df_cleaned, columns=["A", "B", "C", "D", "E"])
    print("Categorical -> dummies \n", df_cleaned.info(5))

“我更换了功能的名称”
“并且没有显示33列,因为其余的列有28519个非挂钩Uint8

Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   A                        28519 non-null  int64 
 1   B                        28519 non-null  int64 
 2   C                        28519 non-null  object
 3   D                        28519 non-null  int64 
 4   E                        28519 non-null  int64 
 5   F                        28519 non-null  object
 6   H                        28519 non-null  int64 
 7   I                        28519 non-null  object
 8   J                        28519 non-null  uint8 
 9   K                        28519 non-null  uint8 
print("looking for nulls (before) \n", df.isnull().sum())

寻找nulls(之前)
0
B 0
C 0
D 0
E 0
F 0
G 0
H 0
我0
J 0

这里发生了一些事情

imputer = KNNImputer(n_neighbors=5)
df_immputed = pd.DataFrame(imputer.fit_transform(df_cleaned.drop("venue", axis=1)), columns=df_cleaned.drop("venue", axis=1).columns)
df_cleaned = pd.concat([df_immputed, df_cleaned["venue"]], axis=1)
print("looking for nulls (after) \n", df.isnull().sum())

现在有

寻找nulls(之后)
28
B 28
C 28
D 28
E 28
F 28
G 28
H 28
我28
J 28

发生了什么事?为什么要创建空?

编辑:

行影响
字母#是假人

原始行

A   B   C           D   E   F           G   H_1 H_2 H_3 H_4 H_5 H_6         
151 128 134110.51   681 532 593894.54   151 0   0   1   0   0   0

H_7 H_8 H_9 H_10    H_11    H_12    I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5         
0   0   0   0       0       0       0   0   1   1   0   0   0   0   
        
J_6 K_1 K_1 L_1 L_2 M
0   0   1   1   0   string value I cannot share sorry

带有nulls之后的无效

A   B   C   D   E   F   G   H_1 H_2 H_3 H_4 H_5 H_6 H_7 H_8 H_9 H_10    
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

H_11    H_12    I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5 J_6 K_1 K_1 L_1 L_2 
NaN     NaN     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

M
string value I cannot share sorry

I was working on a project with sensitive data and stumbled upon this "bug" (probably something that went over my head). Recently I learned about KNNimputer from sklearn and I love its concept. However, it's replacing data with null values. I'm working on a data cleaning and modeling project, at the moment I don't have any null data but I wanted to add code to fill nulls in the case there are in the future but after running the algorithm it's replacing good data with NaN values. Am I using it wrong?

Here is the code:

The libraries I use

from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

Transformed categorical data as dummies

df_cleaned = pd.get_dummies(df_cleaned, columns=["A", "B", "C", "D", "E"])
    print("Categorical -> dummies \n", df_cleaned.info(5))

"I replaced the names of the features"
"And didn't show the 33 columns as the remaining columns have 28519 non-null uint8"

Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   A                        28519 non-null  int64 
 1   B                        28519 non-null  int64 
 2   C                        28519 non-null  object
 3   D                        28519 non-null  int64 
 4   E                        28519 non-null  int64 
 5   F                        28519 non-null  object
 6   H                        28519 non-null  int64 
 7   I                        28519 non-null  object
 8   J                        28519 non-null  uint8 
 9   K                        28519 non-null  uint8 
print("looking for nulls (before) \n", df.isnull().sum())

There are no nulls at this moment

looking for nulls (before)
A 0
B 0
C 0
D 0
E 0
F 0
G 0
H 0
I 0
J 0

Something Happens here

imputer = KNNImputer(n_neighbors=5)
df_immputed = pd.DataFrame(imputer.fit_transform(df_cleaned.drop("venue", axis=1)), columns=df_cleaned.drop("venue", axis=1).columns)
df_cleaned = pd.concat([df_immputed, df_cleaned["venue"]], axis=1)
print("looking for nulls (after) \n", df.isnull().sum())

Now there are

looking for nulls (after)
A 28
B 28
C 28
D 28
E 28
F 28
G 28
H 28
I 28
J 28

What is happening? Why is it creating nulls?

Edit:

Row affected
The Letter_# are dummies

Original row

A   B   C           D   E   F           G   H_1 H_2 H_3 H_4 H_5 H_6         
151 128 134110.51   681 532 593894.54   151 0   0   1   0   0   0

H_7 H_8 H_9 H_10    H_11    H_12    I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5         
0   0   0   0       0       0       0   0   1   1   0   0   0   0   
        
J_6 K_1 K_1 L_1 L_2 M
0   0   1   1   0   string value I cannot share sorry

Row with nulls after Knnimputer

A   B   C   D   E   F   G   H_1 H_2 H_3 H_4 H_5 H_6 H_7 H_8 H_9 H_10    
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

H_11    H_12    I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5 J_6 K_1 K_1 L_1 L_2 
NaN     NaN     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

M
string value I cannot share sorry

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

望笑 2025-02-11 03:56:38

这可能是由于您的数据框架的非标准索引所致。检查输出的形状:如果我正确,则您的行比以前多28行。

出现问题是因为当您重新使用fit_transform的numpy结果时,您将获得标准索引(0 ... N-1)。然后,pd.concat将这些索引与列“ venue”中的原始索引匹配,访问了外部连接。

您可以通过多种方式解决此问题;也许最好的是在重新构图估算的numpy数组时分配正确的帧索引:

df_immputed = pd.DataFrame(
    imputer.fit_transform(df_cleaned.drop("venue", axis=1)),
    columns=df_cleaned.drop("venue", axis=1).columns,
    index=df_cleaned.index,
)

It's probably due to a nonstandard index of your dataframe. Check the shape of the output: if I'm right, you'll have 28 more rows than before.

The problem arises because when you re-dataframe the numpy result of fit_transform, you get a standard index (0...n-1). Then pd.concat matches those indices against the original index in the column "venue", taking an outer join.

You can fix this in a number of ways; maybe the nicest is to assign the correct frame index when re-framing the imputed numpy array:

df_immputed = pd.DataFrame(
    imputer.fit_transform(df_cleaned.drop("venue", axis=1)),
    columns=df_cleaned.drop("venue", axis=1).columns,
    index=df_cleaned.index,
)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文