应用 OneHotEncoding 后,我在数据框中得到 NaN 值,知道如何消除这些值吗?
我正在开发一个聚类项目,并获得了一个不包含空值的数据集。 我需要将 2 个分类列“性别”和“车辆类型”转换为数字,因此我在数据集上应用了“一个热编码器”,但输出中包含 NaN 值。
知道我哪里做错了吗,或者有什么办法可以克服这个问题吗?
数据集详细信息:
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 52989 non-null object
1 agecat 52989 non-null float64
2 date_of_birth 52989 non-null datetime64[ns]
3 credit_score 52989 non-null float64
4 veh_age 52989 non-null int64
5 veh_body 52989 non-null object
6 veh_value 52989 non-null float64
7 numclaims 52989 non-null int64
8 claimcst0 52989 non-null float64
9 year 52989 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB
一个热门编码器代码:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df =
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df
# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc
# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()
输出中包含空值:
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 agecat 52989 non-null float64
1 credit_score 52989 non-null float64
2 veh_age 52989 non-null int64
3 veh_value 52989 non-null float64
4 numclaims 52989 non-null int64
5 claimcst0 52989 non-null float64
6 year 52989 non-null int64
7 gender_F 46510 non-null float64
8 gender_M 46510 non-null float64
9 veh_body_BUS 46510 non-null float64
10 veh_body_CONVT 46510 non-null float64
11 veh_body_COUPE 46510 non-null float64
12 veh_body_HBACK 46510 non-null float64
13 veh_body_HDTOP 46510 non-null float64
14 veh_body_MCARA 46510 non-null float64
15 veh_body_MIBUS 46510 non-null float64
16 veh_body_PANVN 46510 non-null float64
17 veh_body_RDSTR 46510 non-null float64
18 veh_body_SEDAN 46510 non-null float64
19 veh_body_STNWG 46510 non-null float64
20 veh_body_TRUCK 46510 non-null float64
21 veh_body_UTE 46510 non-null float64
dtypes: float64(19), int64(3)
I am working on a clustering project and got a dataset with no null values in it.
There are 2 categorical columns gender and vehical_type which I need to convert to numerics, so I have applied One hot encoder on the dataset but the output has NaN values in it.
Any idea where I am doing it wrong or is there any way to overcome this?
dataset details :
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 52989 non-null object
1 agecat 52989 non-null float64
2 date_of_birth 52989 non-null datetime64[ns]
3 credit_score 52989 non-null float64
4 veh_age 52989 non-null int64
5 veh_body 52989 non-null object
6 veh_value 52989 non-null float64
7 numclaims 52989 non-null int64
8 claimcst0 52989 non-null float64
9 year 52989 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB
one hot encoder code :
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df =
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df
# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc
# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()
output has null values in it :
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 agecat 52989 non-null float64
1 credit_score 52989 non-null float64
2 veh_age 52989 non-null int64
3 veh_value 52989 non-null float64
4 numclaims 52989 non-null int64
5 claimcst0 52989 non-null float64
6 year 52989 non-null int64
7 gender_F 46510 non-null float64
8 gender_M 46510 non-null float64
9 veh_body_BUS 46510 non-null float64
10 veh_body_CONVT 46510 non-null float64
11 veh_body_COUPE 46510 non-null float64
12 veh_body_HBACK 46510 non-null float64
13 veh_body_HDTOP 46510 non-null float64
14 veh_body_MCARA 46510 non-null float64
15 veh_body_MIBUS 46510 non-null float64
16 veh_body_PANVN 46510 non-null float64
17 veh_body_RDSTR 46510 non-null float64
18 veh_body_SEDAN 46510 non-null float64
19 veh_body_STNWG 46510 non-null float64
20 veh_body_TRUCK 46510 non-null float64
21 veh_body_UTE 46510 non-null float64
dtypes: float64(19), int64(3)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我已经重置了索引,现在一切都很好。
I have reset the index and now everything is fine.