knnimputer正在用无效替换数据
我正在研究一个具有敏感数据的项目,并偶然发现了这个“错误”(可能是我脑海中的东西)。最近,我从Sklearn中学到了Knnimputer,我喜欢它的概念。 但是,它正在用无效的值替换数据。我正在研究一个数据清洁和建模项目,目前我没有任何零数据,但我想添加代码以填充nulls情况将来会有,但是在运行算法后,它将用NAN值替换良好的数据。我使用的是错误的吗?
这是代码:
我使用转换的分类数据的库
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
作为假人
df_cleaned = pd.get_dummies(df_cleaned, columns=["A", "B", "C", "D", "E"])
print("Categorical -> dummies \n", df_cleaned.info(5))
“我更换了功能的名称”
“并且没有显示33列,因为其余的列有28519个非挂钩Uint8
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 28519 non-null int64
1 B 28519 non-null int64
2 C 28519 non-null object
3 D 28519 non-null int64
4 E 28519 non-null int64
5 F 28519 non-null object
6 H 28519 non-null int64
7 I 28519 non-null object
8 J 28519 non-null uint8
9 K 28519 non-null uint8
print("looking for nulls (before) \n", df.isnull().sum())
”
寻找nulls(之前)
0
B 0
C 0
D 0
E 0
F 0
G 0
H 0
我0
J 0
这里发生了一些事情
imputer = KNNImputer(n_neighbors=5)
df_immputed = pd.DataFrame(imputer.fit_transform(df_cleaned.drop("venue", axis=1)), columns=df_cleaned.drop("venue", axis=1).columns)
df_cleaned = pd.concat([df_immputed, df_cleaned["venue"]], axis=1)
print("looking for nulls (after) \n", df.isnull().sum())
现在有
寻找nulls(之后)
28
B 28
C 28
D 28
E 28
F 28
G 28
H 28
我28
J 28
发生了什么事?为什么要创建空?
编辑:
行影响
字母#是假人
原始行
A B C D E F G H_1 H_2 H_3 H_4 H_5 H_6
151 128 134110.51 681 532 593894.54 151 0 0 1 0 0 0
H_7 H_8 H_9 H_10 H_11 H_12 I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5
0 0 0 0 0 0 0 0 1 1 0 0 0 0
J_6 K_1 K_1 L_1 L_2 M
0 0 1 1 0 string value I cannot share sorry
带有nulls之后的无效
A B C D E F G H_1 H_2 H_3 H_4 H_5 H_6 H_7 H_8 H_9 H_10
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
H_11 H_12 I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5 J_6 K_1 K_1 L_1 L_2
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
M
string value I cannot share sorry
I was working on a project with sensitive data and stumbled upon this "bug" (probably something that went over my head). Recently I learned about KNNimputer from sklearn and I love its concept. However, it's replacing data with null values. I'm working on a data cleaning and modeling project, at the moment I don't have any null data but I wanted to add code to fill nulls in the case there are in the future but after running the algorithm it's replacing good data with NaN values. Am I using it wrong?
Here is the code:
The libraries I use
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
Transformed categorical data as dummies
df_cleaned = pd.get_dummies(df_cleaned, columns=["A", "B", "C", "D", "E"])
print("Categorical -> dummies \n", df_cleaned.info(5))
"I replaced the names of the features"
"And didn't show the 33 columns as the remaining columns have 28519 non-null uint8"
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 28519 non-null int64
1 B 28519 non-null int64
2 C 28519 non-null object
3 D 28519 non-null int64
4 E 28519 non-null int64
5 F 28519 non-null object
6 H 28519 non-null int64
7 I 28519 non-null object
8 J 28519 non-null uint8
9 K 28519 non-null uint8
print("looking for nulls (before) \n", df.isnull().sum())
There are no nulls at this moment
looking for nulls (before)
A 0
B 0
C 0
D 0
E 0
F 0
G 0
H 0
I 0
J 0
Something Happens here
imputer = KNNImputer(n_neighbors=5)
df_immputed = pd.DataFrame(imputer.fit_transform(df_cleaned.drop("venue", axis=1)), columns=df_cleaned.drop("venue", axis=1).columns)
df_cleaned = pd.concat([df_immputed, df_cleaned["venue"]], axis=1)
print("looking for nulls (after) \n", df.isnull().sum())
Now there are
looking for nulls (after)
A 28
B 28
C 28
D 28
E 28
F 28
G 28
H 28
I 28
J 28
What is happening? Why is it creating nulls?
Edit:
Row affected
The Letter_# are dummies
Original row
A B C D E F G H_1 H_2 H_3 H_4 H_5 H_6
151 128 134110.51 681 532 593894.54 151 0 0 1 0 0 0
H_7 H_8 H_9 H_10 H_11 H_12 I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5
0 0 0 0 0 0 0 0 1 1 0 0 0 0
J_6 K_1 K_1 L_1 L_2 M
0 0 1 1 0 string value I cannot share sorry
Row with nulls after Knnimputer
A B C D E F G H_1 H_2 H_3 H_4 H_5 H_6 H_7 H_8 H_9 H_10
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
H_11 H_12 I_0 I_1 I_2 J_1 J_2 J_3 J_4 J_5 J_6 K_1 K_1 L_1 L_2
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
M
string value I cannot share sorry
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这可能是由于您的数据框架的非标准索引所致。检查输出的形状:如果我正确,则您的行比以前多28行。
出现问题是因为当您重新使用
fit_transform
的numpy结果时,您将获得标准索引(0 ... N-1)。然后,pd.concat
将这些索引与列“ venue”
中的原始索引匹配,访问了外部连接。您可以通过多种方式解决此问题;也许最好的是在重新构图估算的numpy数组时分配正确的帧索引:
It's probably due to a nonstandard index of your dataframe. Check the shape of the output: if I'm right, you'll have 28 more rows than before.
The problem arises because when you re-dataframe the numpy result of
fit_transform
, you get a standard index (0...n-1). Thenpd.concat
matches those indices against the original index in the column"venue"
, taking an outer join.You can fix this in a number of ways; maybe the nicest is to assign the correct frame index when re-framing the imputed numpy array: