Python VIF 返回虚拟变量的无穷大值

发布于 2025-01-12 13:25:47 字数 1148 浏览 1 评论 0原文

因此,在中风预测数据集中,我为所有分类变量创建了虚拟变量,即性别_男性和性别_女性、吸烟_状态_吸烟和吸烟_状态_未知等。现在为了检查所有变量(数字和虚拟变量)的多重共线性,我应用了方差膨胀函数:

import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()

vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data

我得到的输出如下:

feature VIF
0   age 2.836394
1   hypertension    1.111484
2   heart_disease   1.113943
3   avg_glucose_level   1.107552
4   bmi 1.342729
5   gender_Female   inf
6   gender_Male inf
7   ever_married_No inf
8   ever_married_Yes    inf
9   work_type_Govt_job  inf
10  work_type_Never_worked  inf
11  work_type_Private   inf
12  work_type_Self-employed inf
13  work_type_children  inf
14  Residence_type_Rural    inf
15  Residence_type_Urban    inf
16  smoking_status_formerly smoked  inf
17  smoking_status_never smoked inf
18  smoking_status_smokes   inf

有人可以解释为什么虚拟变量的 vif 无穷大吗?有没有更好的方法来检查多重共线性?谢谢

So in the stroke prediction dataset, I've created dummy variables for all the categorical variables, i.e gender_male and gender_female, smoking_status_smokes and smoking_status_unknown and so on. Now to check for multicollinearity for all the variables (numerical and dummy), I've applied the variance inflation function:

import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()

vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data

The output that I get is below:

feature VIF
0   age 2.836394
1   hypertension    1.111484
2   heart_disease   1.113943
3   avg_glucose_level   1.107552
4   bmi 1.342729
5   gender_Female   inf
6   gender_Male inf
7   ever_married_No inf
8   ever_married_Yes    inf
9   work_type_Govt_job  inf
10  work_type_Never_worked  inf
11  work_type_Private   inf
12  work_type_Self-employed inf
13  work_type_children  inf
14  Residence_type_Rural    inf
15  Residence_type_Urban    inf
16  smoking_status_formerly smoked  inf
17  smoking_status_never smoked inf
18  smoking_status_smokes   inf

Can somebody please explain why are the vif of the dummy variables infinity? Is there a better way to check for multicollinearity? Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文