我正在尝试运行 car :: vif()
在R中测试,以测试多重共线性。但是,当我运行代码时
reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house +
Attached.houses + Apartment +
Stock.apartment + Housing.cooperative + Sole.owner + Age +
BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner +
Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen +
Grünerløkka + Sagene + Frogner
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)
,我会在控制台中收到此错误:
vif.default(reg1)中的错误:模型中有其他系数。
我读到的是,这意味着模型中有一些高度关联的东西。当我查看相关矩阵时,唯一高度关联的是因变量 price
。但是我也读到某个地方,即使因变量高度关联,依赖变量也可以。我还发现 bra
在相关性中为0.8,因此我尝试在没有此的情况下再次运行它,但仍然会遇到相同的错误。有人知道问题可能是什么,还是我可以尝试做些什么?
I am trying to run a car::vif()
test in R, to test for multicollinearity. However, when I run the code
reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house +
Attached.houses + Apartment +
Stock.apartment + Housing.cooperative + Sole.owner + Age +
BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner +
Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen +
Grünerløkka + Sagene + Frogner
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)
I receive this error in the console:
Error in vif.default(reg1) : there are aliased coefficients in the model.
What I have read is that this means that there is something in the model that is highly correlated. When I look at the correlation matrix the only thing that is highly correlated is the dependent variable Price
. But I also read somewhere that the dependent variable is okay even if it's highly correlated. I also found out that BRA
is 0.8 in correlation so I tried to run it again without this, and still get the same error. Does anyone know what the problem could be, or what I could try to do differently?
发布评论
评论(2)
这告诉您某些预测变量集完全(多重)共线;如果您查看
coef(reg1)
,您将看到至少一个NA
值,如果您运行summary(lm)
,您将看到信息(对于某些 n>=1)。检查预测变量的成对相关性是不够的,因为如果您有(例如)预测变量 A、B、C,其中没有任何成对相关性(的绝对值)恰好为 1,它们仍然可以是 multi< /em>共线。 (最常见的情况可能是 A、B、C 是虚拟变量,它们描述一组互斥且完整的可能性 [即,对于每个观察,A、B、C 中恰好有一个为 1,另外两个为 0]。I强烈怀疑这就是你最后 16 个左右的变量所发生的情况,这些变量似乎是奥斯陆的行政区......)
检查回归的哪些系数是
NA
(正如@Axeman建议的那样) )可以提出问题所在;这个答案解释了如何使用
model.matrix()
和caret::findLinearCombos
来准确找出导致问题的预测变量集。 (如果所有预测变量都是简单的数值变量,您可以跳过 model.matrix()。)如果您的问题确实是由于为每个可能的地理区域包含虚拟变量而引起的,那么最简单/最好的解决方案是将地理区域(行政区)作为一个因素包含在模型中:如果您这样做,R 将自动生成一组虚拟对象/对比,但它会自动留下一个虚拟对象 em> 以避免这种情况 问题。如果您稍后想要返回并获取每个行政区的预测值,您可以使用
emmeans
或effects
包中的工具。This is telling you that some set(s) of predictors is/are perfectly (multi)collinear; if you looked at
coef(reg1)
you would see at least oneNA
value, and if you ransummary(lm)
you would see the message(for some n>=1). Examining the pairwise correlations of the predictor variables is not enough, because if you have (e.g.) predictors A, B, C where (the absolute values of) none of the pairwise correlations are exactly 1, they can still be multicollinear. (Probably the most common case is where A, B, C are dummy variables that describe a mutually exclusive and complete set of possibilities [i.e. for each observation exactly one of A, B, C is 1 and the other two are 0]. I strongly suspect that this is what's going on with your last 16 or so variables, which seem to be boroughs of Oslo ...)
Checking to see which coefficients of the regression are
NA
(as suggested by @Axeman) can suggest where the problem is;this answer explains how you can use
model.matrix()
andcaret::findLinearCombos
to figure out exactly which sets of predictors are causing the problem. (If all of your predictors are simple numeric variables you can skipmodel.matrix()
.)If your problem is indeed caused by including a dummy variable for every possible geographic region, the simplest/best solution is to include geographic region (borough) in the model as a factor: if you do this, R will automatically generate a set of dummies/contrasts, but it will leave one dummy out automatically to avoid this kind of problem. If you later want to go back and get predicted values for every borough, you can use tools from the
emmeans
oreffects
packages.我四处搜索解决方案,因为我无法根据答案解决这些解决方案。但是,答案帮助我更好地理解了我的问题。解决我的问题的解决方案很简单,即为一个虚拟变量之一放下负而而不是加上加号。这最初是我之前发布的代码:
要解决我的问题,我必须简单地将代码更改为:
您可以看到我有3个系列假人,并且要确保不会发生多重共线性,我必须从中删除一个假人每个。我拆除了房屋类型的公寓,一种所有权类型的唯一所有者以及该地区的Frogner。本网站解释了这个问题和解决方案比我更好,更简单()!
I searched around for solutions since I couldn't solve them based on the answers. The answers, however, helped me understand my problem better. The solution to my problem was as simple as to put a minus instead of plus for one of the dummy variables. This was originally my code as I posted earlier:
To solve my issue i had to simply change my code to:
As you can see I have 3 series of dummies, and to make sure multicollinearity doesn't occur I have to remove one dummy from each one. I have removed apartments for the type of home, sole owner for a type of ownership, and Frogner for the district. This website explained this problem and solution much better and simpler than I (https://www.learndatasci.com/glossary/dummy-variable-trap/)!