应用 OneHotEncoding 后，我在数据框中得到 NaN 值，知道如何消除这些值吗？

发布于 2025-01-11 15:04:12 字数 2716 浏览 1 评论 0原文

我正在开发一个聚类项目，并获得了一个不包含空值的数据集。我需要将 2 个分类列“性别”和“车辆类型”转换为数字，因此我在数据集上应用了“一个热编码器”，但输出中包含 NaN 值。

知道我哪里做错了吗，或者有什么办法可以克服这个问题吗？

数据集详细信息：

Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   gender         52989 non-null  object        
 1   agecat         52989 non-null  float64       
 2   date_of_birth  52989 non-null  datetime64[ns]
 3   credit_score   52989 non-null  float64       
 4   veh_age        52989 non-null  int64         
 5   veh_body       52989 non-null  object        
 6   veh_value      52989 non-null  float64       
 7   numclaims      52989 non-null  int64         
 8   claimcst0      52989 non-null  float64       
 9   year           52989 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB

一个热门编码器代码：

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df = 
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df

# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc

# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()

输出中包含空值：

Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agecat          52989 non-null  float64
 1   credit_score    52989 non-null  float64
 2   veh_age         52989 non-null  int64  
 3   veh_value       52989 non-null  float64
 4   numclaims       52989 non-null  int64  
 5   claimcst0       52989 non-null  float64
 6   year            52989 non-null  int64  
 7   gender_F        46510 non-null  float64
 8   gender_M        46510 non-null  float64
 9   veh_body_BUS    46510 non-null  float64
 10  veh_body_CONVT  46510 non-null  float64
 11  veh_body_COUPE  46510 non-null  float64
 12  veh_body_HBACK  46510 non-null  float64
 13  veh_body_HDTOP  46510 non-null  float64
 14  veh_body_MCARA  46510 non-null  float64
 15  veh_body_MIBUS  46510 non-null  float64
 16  veh_body_PANVN  46510 non-null  float64
 17  veh_body_RDSTR  46510 non-null  float64
 18  veh_body_SEDAN  46510 non-null  float64
 19  veh_body_STNWG  46510 non-null  float64
 20  veh_body_TRUCK  46510 non-null  float64
 21  veh_body_UTE    46510 non-null  float64
dtypes: float64(19), int64(3)

原文

I am working on a clustering project and got a dataset with no null values in it.
There are 2 categorical columns gender and vehical_type which I need to convert to numerics, so I have applied One hot encoder on the dataset but the output has NaN values in it.

Any idea where I am doing it wrong or is there any way to overcome this?

dataset details :

Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   gender         52989 non-null  object        
 1   agecat         52989 non-null  float64       
 2   date_of_birth  52989 non-null  datetime64[ns]
 3   credit_score   52989 non-null  float64       
 4   veh_age        52989 non-null  int64         
 5   veh_body       52989 non-null  object        
 6   veh_value      52989 non-null  float64       
 7   numclaims      52989 non-null  int64         
 8   claimcst0      52989 non-null  float64       
 9   year           52989 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB

one hot encoder code :

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df = 
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df

# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc

# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()

output has null values in it :

Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agecat          52989 non-null  float64
 1   credit_score    52989 non-null  float64
 2   veh_age         52989 non-null  int64  
 3   veh_value       52989 non-null  float64
 4   numclaims       52989 non-null  int64  
 5   claimcst0       52989 non-null  float64
 6   year            52989 non-null  int64  
 7   gender_F        46510 non-null  float64
 8   gender_M        46510 non-null  float64
 9   veh_body_BUS    46510 non-null  float64
 10  veh_body_CONVT  46510 non-null  float64
 11  veh_body_COUPE  46510 non-null  float64
 12  veh_body_HBACK  46510 non-null  float64
 13  veh_body_HDTOP  46510 non-null  float64
 14  veh_body_MCARA  46510 non-null  float64
 15  veh_body_MIBUS  46510 non-null  float64
 16  veh_body_PANVN  46510 non-null  float64
 17  veh_body_RDSTR  46510 non-null  float64
 18  veh_body_SEDAN  46510 non-null  float64
 19  veh_body_STNWG  46510 non-null  float64
 20  veh_body_TRUCK  46510 non-null  float64
 21  veh_body_UTE    46510 non-null  float64
dtypes: float64(19), int64(3)

分享到QQ

分享到微博