筛选回归模型中的（多重）共线性

发布于 2024-09-06 07:36:14 字数 368 浏览 5 评论 0原文

我希望这个问题不会是“一问一答”的问题......这里是：（多重）共线性是指回归模型中预测变量之间极高的相关性。如何治愈它们......好吧，有时您不需要“治愈”共线性，因为它不会影响回归模型本身，而是影响单个预测变量的影响。

发现共线性的一种方法是将每个预测变量作为因变量，将其他预测变量作为自变量，确定 R²，如果它大于 0.9（或 0.95），我们可以考虑预测变量多余的。这是一种“方法”……其他方法呢？其中一些非常耗时，例如从模型中排除预测变量并观察 b 系数变化 - 它们应该明显不同。

当然，我们必须始终牢记分析的具体背景/目标......有时，唯一的补救办法就是重复研究，但现在，我对在（多重）共线性时筛选冗余预测变量的各种方法感兴趣发生在回归模型中。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长梦不多时 2024-09-13 07:36:23

由于到目前为止还没有提到 VIF，所以我将添加我的答案。方差膨胀因子＞10通常表明预测变量之间存在严重冗余。 VIF 表示在与其他变量不高度相关的情况下，变量系数的方差会增加的因子。

vif() 在 cars 包中可用，并应用于类 (lm) 的对象。它返回 x1, x2 的 vif。。。 xn 在对象 lm() 中。排除 vif > 10 的变量或对 vif > 10 的变量引入变换是一个好主意。

回复收藏 0 原文

智商已欠费 2024-09-13 07:36:22

另请参见本书第 9.4 节：使用 R 的实用回归和方差分析 [Faraway 2002]。

可以通过多种方式检测共线性：

检查预测变量的相关矩阵将揭示较大的成对共线性。
x_i 对所有其他预测变量的回归给出 R^2_i。对所有预测变量重复此操作。 R^2_i 接近 1 表示存在问题 — 可能会找到有问题的线性组合。
检查t(X) %*% X的特征值，其中X表示模型矩阵；小特征值表明存在问题。 2-范数条件数可以表示为矩阵的最大与最小非零奇异值之比（$\kappa = \sqrt{\lambda_1/\lambda_p}$；参见 ?kappa< /代码>); \kappa >= 30 被认为很大。

回复收藏 0 原文

画尸师 2024-09-13 07:36:20

您可能喜欢 Vito Ricci 的参考卡“回归分析的 R 函数”
http://cran.r-project.org/doc/contrib /Ricci-refcard-regression.pdf

它简洁地列出了 R 中许多有用的回归相关函数，包括诊断函数。
特别是，它列出了 car 包中的 vif 函数，该函数可以评估多重共线性。
http://en.wikipedia.org/wiki/Variance_inflation_factor

对多重共线性的考虑通常是齐头并进的评估变量重要性的问题。如果这适用于您，也许可以查看 relaimpo 包：http: //prof.beuth-hochschule.de/groemping/relaimpo/

回复收藏 0 原文

草莓味的萝莉 2024-09-13 07:36:19

只是为了补充 Dirk 关于条件数方法的说法，经验法则是 CN > 的值必须为 1。 30表示严重共线性。除条件数外，其他方法包括：

1) 协方差的行列式
矩阵范围从0（完美
共线性）到 1（无共线性）

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) 利用对角矩阵的行列式是特征值的乘积这一事实 =>存在一个或多个小特征值表示共线性

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) 方差膨胀因子 (VIF) 的值。预测变量 i 的 VIF 为 1/(1-R_i^2)，其中 R_i^2 是预测变量 i 相对于其余预测变量的回归得到的 R^2。当至少一个自变量的 VIF 较大时，就会出现共线性。经验法则：VIF > 10 值得关注。有关 R 中的实现，请参阅此处。我还想评论的是，使用 R^2 来确定共线性应该与散点图的目视检查同时进行，因为单个异常值可以在不存在共线性的情况下“导致”共线性，或者可以在存在共线性的情况下隐藏共线性。

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance
matrix which ranges from 0 (Perfect
Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

回复收藏 0 原文

赠意 2024-09-13 07:36:17

kappa() 函数可以提供帮助。这是一个模拟示例：

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

我们进一步使第三个回归量越来越共线：

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

这使用了近似值，请参阅 help(kappa) 了解详细信息。

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

This used approximations, see help(kappa) for details.

回复收藏 0 原文

~没有更多了~