筛选回归模型中的(多重)共线性

发布于 2024-09-06 07:36:14 字数 368 浏览 5 评论 0原文

我希望这个问题不会是“一问一答”的问题......这里是: (多重)共线性是指回归模型中预测变量之间极高的相关性。如何治愈它们......好吧,有时您不需要“治愈”共线性,因为它不会影响回归模型本身,而是影响单个预测变量的影响。

发现共线性的一种方法是将每个预测变量作为因变量,将其他预测变量作为自变量,确定 R2,如果它大于 0.9(或 0.95),我们可以考虑预测变量多余的。这是一种“方法”……其他方法呢?其中一些非常耗时,例如从模型中排除预测变量并观察 b 系数变化 - 它们应该明显不同。

当然,我们必须始终牢记分析的具体背景/目标......有时,唯一的补救办法就是重复研究,但现在,我对在(多重)共线性时筛选冗余预测变量的各种方法感兴趣发生在回归模型中。

I hope that this one is not going to be "ask-and-answer" question... here goes:
(multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.

One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.

Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

长梦不多时 2024-09-13 07:36:23

由于到目前为止还没有提到 VIF,所以我将添加我的答案。方差膨胀因子>10通常表明预测变量之间存在严重冗余。 VIF 表示在与其他变量不高度相关的情况下,变量系数的方差会增加的因子。

vif()cars 包中可用,并应用于类 (lm) 的对象。它返回 x1, x2 的 vif。 。 。 xn 在对象 lm() 中。排除 vif > 10 的变量或对 vif > 10 的变量引入变换是一个好主意。

Since there is no mention of VIF so far, I will add my answer. Variance Inflation Factor>10 usually indicates serious redundancy between predictor variables. VIF indicates the factor by which variance of the co-efficient of a variable would increase if it was not highly correlated with other variables.

vif() is available in package cars and applied to an object of class(lm). It returns the vif of x1, x2 . . . xn in object lm(). It is a good idea to exclude variables with vif >10 or introduce transformations to the variables with vif>10.

智商已欠费 2024-09-13 07:36:22

另请参见本书第 9.4 节:使用 R 的实用回归和方差分析 [Faraway 2002]

可以通过多种方式检测共线性:

  1. 检查预测变量的相关矩阵将揭示较大的成对共线性。

  2. x_i 对所有其他预测变量的回归给出 R^2_i。对所有预测变量重复此操作。 R^2_i 接近 1 表示存在问题 — 可能会找到有问题的线性组合。

  3. 检查t(X) %*% X的特征值,其中X表示模型矩阵;小特征值表明存在问题。 2-范数条件数可以表示为矩阵的最大与最小非零奇异值之比($\kappa = \sqrt{\lambda_1/\lambda_p}$;参见 ?kappa< /代码>); \kappa >= 30 被认为很大。

See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].

Collinearity can be detected in several ways:

  1. Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.

  2. A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.

  3. Examine the eigenvalues of t(X) %*% X, where X denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see ?kappa); \kappa >= 30 is considered large.

画尸师 2024-09-13 07:36:20

您可能喜欢 Vito Ricci 的参考卡“回归分析的 R 函数”
http://cran.r-project.org/doc/contrib /Ricci-refcard-regression.pdf

它简洁地列出了 R 中许多有用的回归相关函数,包括诊断函数。
特别是,它列出了 car 包中的 vif 函数,该函数可以评估多重共线性。
http://en.wikipedia.org/wiki/Variance_inflation_factor

对多重共线性的考虑通常是齐头并进的评估变量重要性的问题。如果这适用于您,也许可以查看 relaimpo 包:http: //prof.beuth-hochschule.de/groemping/relaimpo/

You might like Vito Ricci's Reference Card "R Functions For Regression Analysis"
http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions.
In particular, it lists the vif function from the car package which can assess multicollinearity.
http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

草莓味的萝莉 2024-09-13 07:36:19

只是为了补充 Dirk 关于条件数方法的说法,经验法则是 CN > 的值必须为 1。 30表示严重共线性。除条件数外,其他方法包括:

1) 协方差的行列式
矩阵范围从0(完美
共线性)到 1(无共线性)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) 利用对角矩阵的行列式是特征值的乘积这一事实 =>存在一个或多个小特征值表示共线性

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) 方差膨胀因子 (VIF) 的值。预测变量 i 的 VIF 为 1/(1-R_i^2),其中 R_i^2 是预测变量 i 相对于其余预测变量的回归得到的 R^2。当至少一个自变量的 VIF 较大时,就会出现共线性。经验法则:VIF > 10 值得关注。有关 R 中的实现,请参阅此处。我还想评论的是,使用 R^2 来确定共线性应该与散点图的目视检查同时进行,因为单个异常值可以在不存在共线性的情况下“导致”共线性,或者可以在存在共线性的情况下隐藏共线性。

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance
matrix which ranges from 0 (Perfect
Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

赠意 2024-09-13 07:36:17

kappa() 函数可以提供帮助。这是一个模拟示例:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

我们进一步使第三个回归量越来越共线:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
> 

这使用了近似值,请参阅 help(kappa) 了解详细信息。

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
> 

This used approximations, see help(kappa) for details.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文