Python VIF 返回虚拟变量的无穷大值
因此,在中风预测数据集中,我为所有分类变量创建了虚拟变量,即性别_男性和性别_女性、吸烟_状态_吸烟和吸烟_状态_未知等。现在为了检查所有变量(数字和虚拟变量)的多重共线性,我应用了方差膨胀函数:
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data
我得到的输出如下:
feature VIF
0 age 2.836394
1 hypertension 1.111484
2 heart_disease 1.113943
3 avg_glucose_level 1.107552
4 bmi 1.342729
5 gender_Female inf
6 gender_Male inf
7 ever_married_No inf
8 ever_married_Yes inf
9 work_type_Govt_job inf
10 work_type_Never_worked inf
11 work_type_Private inf
12 work_type_Self-employed inf
13 work_type_children inf
14 Residence_type_Rural inf
15 Residence_type_Urban inf
16 smoking_status_formerly smoked inf
17 smoking_status_never smoked inf
18 smoking_status_smokes inf
有人可以解释为什么虚拟变量的 vif 无穷大吗?有没有更好的方法来检查多重共线性?谢谢
So in the stroke prediction dataset, I've created dummy variables for all the categorical variables, i.e gender_male and gender_female, smoking_status_smokes and smoking_status_unknown and so on. Now to check for multicollinearity for all the variables (numerical and dummy), I've applied the variance inflation function:
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data
The output that I get is below:
feature VIF
0 age 2.836394
1 hypertension 1.111484
2 heart_disease 1.113943
3 avg_glucose_level 1.107552
4 bmi 1.342729
5 gender_Female inf
6 gender_Male inf
7 ever_married_No inf
8 ever_married_Yes inf
9 work_type_Govt_job inf
10 work_type_Never_worked inf
11 work_type_Private inf
12 work_type_Self-employed inf
13 work_type_children inf
14 Residence_type_Rural inf
15 Residence_type_Urban inf
16 smoking_status_formerly smoked inf
17 smoking_status_never smoked inf
18 smoking_status_smokes inf
Can somebody please explain why are the vif of the dummy variables infinity? Is there a better way to check for multicollinearity? Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论