应用 OneHotEncoding 后,我在数据框中得到 NaN 值,知道如何消除这些值吗?

发布于 2025-01-11 15:04:12 字数 2716 浏览 1 评论 0原文

我正在开发一个聚类项目,并获得了一个不包含空值的数据集。 我需要将 2 个分类列“性别”和“车辆类型”转换为数字,因此我在数据集上应用了“一个热编码器”,但输出中包含 NaN 值。

知道我哪里做错了吗,或者有什么办法可以克服这个问题吗?

数据集详细信息:

Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   gender         52989 non-null  object        
 1   agecat         52989 non-null  float64       
 2   date_of_birth  52989 non-null  datetime64[ns]
 3   credit_score   52989 non-null  float64       
 4   veh_age        52989 non-null  int64         
 5   veh_body       52989 non-null  object        
 6   veh_value      52989 non-null  float64       
 7   numclaims      52989 non-null  int64         
 8   claimcst0      52989 non-null  float64       
 9   year           52989 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB

一个热门编码器代码:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df = 
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df

# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc

# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()

输出中包含空值:

Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agecat          52989 non-null  float64
 1   credit_score    52989 non-null  float64
 2   veh_age         52989 non-null  int64  
 3   veh_value       52989 non-null  float64
 4   numclaims       52989 non-null  int64  
 5   claimcst0       52989 non-null  float64
 6   year            52989 non-null  int64  
 7   gender_F        46510 non-null  float64
 8   gender_M        46510 non-null  float64
 9   veh_body_BUS    46510 non-null  float64
 10  veh_body_CONVT  46510 non-null  float64
 11  veh_body_COUPE  46510 non-null  float64
 12  veh_body_HBACK  46510 non-null  float64
 13  veh_body_HDTOP  46510 non-null  float64
 14  veh_body_MCARA  46510 non-null  float64
 15  veh_body_MIBUS  46510 non-null  float64
 16  veh_body_PANVN  46510 non-null  float64
 17  veh_body_RDSTR  46510 non-null  float64
 18  veh_body_SEDAN  46510 non-null  float64
 19  veh_body_STNWG  46510 non-null  float64
 20  veh_body_TRUCK  46510 non-null  float64
 21  veh_body_UTE    46510 non-null  float64
dtypes: float64(19), int64(3)

I am working on a clustering project and got a dataset with no null values in it.
There are 2 categorical columns gender and vehical_type which I need to convert to numerics, so I have applied One hot encoder on the dataset but the output has NaN values in it.

Any idea where I am doing it wrong or is there any way to overcome this?

dataset details :

Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   gender         52989 non-null  object        
 1   agecat         52989 non-null  float64       
 2   date_of_birth  52989 non-null  datetime64[ns]
 3   credit_score   52989 non-null  float64       
 4   veh_age        52989 non-null  int64         
 5   veh_body       52989 non-null  object        
 6   veh_value      52989 non-null  float64       
 7   numclaims      52989 non-null  int64         
 8   claimcst0      52989 non-null  float64       
 9   year           52989 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(2)
memory usage: 4.4+ MB

one hot encoder code :

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
enc_df = 
pd.DataFrame(encoder.fit_transform(df_updated_new1[['gender',
'veh_body']]).toarray())
enc_df.columns = encoder.get_feature_names(['gender', 'veh_body'])
enc_df

# merge with main df df_updated_new1 on key values
df_updated_enc = df_updated_new1.join(enc_df)
df_updated_enc

# drop column_2 and column_4
df_updated_enc = df_updated_enc.drop(['gender', 'veh_body'], axis =1)
df_updated_enc.tail()

output has null values in it :

Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agecat          52989 non-null  float64
 1   credit_score    52989 non-null  float64
 2   veh_age         52989 non-null  int64  
 3   veh_value       52989 non-null  float64
 4   numclaims       52989 non-null  int64  
 5   claimcst0       52989 non-null  float64
 6   year            52989 non-null  int64  
 7   gender_F        46510 non-null  float64
 8   gender_M        46510 non-null  float64
 9   veh_body_BUS    46510 non-null  float64
 10  veh_body_CONVT  46510 non-null  float64
 11  veh_body_COUPE  46510 non-null  float64
 12  veh_body_HBACK  46510 non-null  float64
 13  veh_body_HDTOP  46510 non-null  float64
 14  veh_body_MCARA  46510 non-null  float64
 15  veh_body_MIBUS  46510 non-null  float64
 16  veh_body_PANVN  46510 non-null  float64
 17  veh_body_RDSTR  46510 non-null  float64
 18  veh_body_SEDAN  46510 non-null  float64
 19  veh_body_STNWG  46510 non-null  float64
 20  veh_body_TRUCK  46510 non-null  float64
 21  veh_body_UTE    46510 non-null  float64
dtypes: float64(19), int64(3)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

飘逸的'云 2025-01-18 15:04:12

我已经重置了索引,现在一切都很好。

I have reset the index and now everything is fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文