`lm`摘要不显示所有因素级别

发布于 2025-02-12 11:16:50 字数 1827 浏览 4 评论 0原文

我正在对许多属性进行线性回归，包括两个分类属性，b和f，并且我没有每个因子级别的系数值。

B具有9个级别，f具有6个级别。当我最初运行模型（带有截距）时，我为b和5的5个系数和f的5个系数，我理解为拦截中每个级别的第一级。

我想根据其系数对b和f中的级别进行排名我可以获得各个级别的系数。

Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
a     2.082e+03  1.026e+02  20.302  < 2e-16 ***
B1   -1.660e+04  9.747e+02 -17.027  < 2e-16 ***
B2   -1.681e+04  9.379e+02 -17.920  < 2e-16 ***
B3   -1.653e+04  9.254e+02 -17.858  < 2e-16 ***
B4   -1.765e+04  9.697e+02 -18.202  < 2e-16 ***
B5   -1.535e+04  1.388e+03 -11.059  < 2e-16 ***
B6   -1.677e+04  9.891e+02 -16.954  < 2e-16 ***
B7   -1.644e+04  9.694e+02 -16.961  < 2e-16 ***
B8   -1.931e+04  9.899e+02 -19.512  < 2e-16 ***
B9   -1.722e+04  9.071e+02 -18.980  < 2e-16 ***
c    -6.928e-01  6.977e-01  -0.993 0.321272    
d    -3.288e-01  2.613e+00  -0.126 0.899933    
e    -8.384e-01  1.171e+00  -0.716 0.474396    
F2    4.679e+02  2.176e+02   2.150 0.032146 *  
F3    7.753e+02  2.035e+02   3.810 0.000159 ***
F4    1.885e+02  1.689e+02   1.116 0.265046    
F5    5.194e+02  2.264e+02   2.295 0.022246 *  
F6    1.365e+03  2.334e+02   5.848 9.94e-09 ***
g     4.278e+00  7.350e+00   0.582 0.560847    
h     2.717e-02  5.100e-03   5.328 1.62e-07 ***

这部分工作，导致所有级别的b的显示，但是f1仍未显示。由于不再有拦截，所以我很困惑为什么f1不在线性模型中。

切换呼叫的顺序，以便+ f -1 先于+ b -1导致所有级别的系数f可见但b1。。

是否有人知道如何显示所有级别b和f，或与其他级别的f1的相对权重与其他级别的f1相比我有输出？

原文

I am running a linear regression on a number of attributes including two categorical attributes, B and F, and I don't get a coefficient value for every factor level I have.

B has 9 levels and F has 6 levels. When I initially ran the model (with intercepts), I got 8 coefficients for B and 5 for F which I understood as the first level of each being included in the intercept.

I want ranking the levels within B and F based on their coefficient so I added -1 after each factor to lock the intercept at 0 so that I could get coefficients for all levels.

Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
a     2.082e+03  1.026e+02  20.302  < 2e-16 ***
B1   -1.660e+04  9.747e+02 -17.027  < 2e-16 ***
B2   -1.681e+04  9.379e+02 -17.920  < 2e-16 ***
B3   -1.653e+04  9.254e+02 -17.858  < 2e-16 ***
B4   -1.765e+04  9.697e+02 -18.202  < 2e-16 ***
B5   -1.535e+04  1.388e+03 -11.059  < 2e-16 ***
B6   -1.677e+04  9.891e+02 -16.954  < 2e-16 ***
B7   -1.644e+04  9.694e+02 -16.961  < 2e-16 ***
B8   -1.931e+04  9.899e+02 -19.512  < 2e-16 ***
B9   -1.722e+04  9.071e+02 -18.980  < 2e-16 ***
c    -6.928e-01  6.977e-01  -0.993 0.321272    
d    -3.288e-01  2.613e+00  -0.126 0.899933    
e    -8.384e-01  1.171e+00  -0.716 0.474396    
F2    4.679e+02  2.176e+02   2.150 0.032146 *  
F3    7.753e+02  2.035e+02   3.810 0.000159 ***
F4    1.885e+02  1.689e+02   1.116 0.265046    
F5    5.194e+02  2.264e+02   2.295 0.022246 *  
F6    1.365e+03  2.334e+02   5.848 9.94e-09 ***
g     4.278e+00  7.350e+00   0.582 0.560847    
h     2.717e-02  5.100e-03   5.328 1.62e-07 ***

This worked in part, leading to the display of all levels of B, however F1 is still not displayed. As there is no longer an intercept I am confused why F1 is not in the linear model.

Switching the order of the call so that + F - 1 precedes + B - 1 results in coefficients of all levels of F being visible but not B1.

Does anybody know either how to display all levels of both B and F, or how to assess the relative weight of F1 compared to other levels of F from the outputs I have?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小瓶盖 2025-02-19 11:16:50

这个问题一遍又一遍地提出，但不幸的是，没有做出令人满意的答案，这可能是一个适当的重复目标。看起来我需要写一个。

大多数人都知道这与“对比”有关，但并不是每个人都知道为什么需要它，以及如何理解其结果。我们必须查看模型矩阵才能完全消化。

假设我们对一个有两个因素的模型感兴趣：〜f + g（数值协变量无关紧要，所以我不包含它们；响应不会出现在模型矩阵中，所以也将其丢弃））。考虑以下可重现的示例：

set.seed(0)

f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c

g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C

我们从一个模型矩阵开始，根本没有对比：

X0 <- model.matrix(~ f + g, contrasts.arg = list(
                   f = contr.treatment(n = 3, contrasts = FALSE),
                   g = contr.treatment(n = 3, contrasts = FALSE)))

#   (Intercept) f1 f2 f3 g1 g2 g3
#1            1  0  0  1  1  0  0
#2            1  1  0  0  0  1  0
#3            1  1  0  0  1  0  0
#4            1  0  1  0  0  1  0
#5            1  0  1  0  0  0  1
#6            1  1  0  0  0  1  0
#7            1  0  0  1  0  0  1
#8            1  0  1  0  1  0  0
#9            1  0  0  1  0  0  1
#10           1  0  1  0  0  0  1
#11           1  1  0  0  1  0  0
#12           1  0  0  1  0  1  0

请注意，我们有：

unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

unname( rowSums(X0[, c("g1", "g2", "g3")]) ) 
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

so span {f1，f2，f3} = span {g1，g2，g3} }。 在此完整规范中，不可识别2列。 x0将具有列级1 + 3 + 3-2 = 5 ：

qr(X0)$rank
# [1] 5

因此，如果我们使用此x0 ，7个参数中的2个系数将为na：

y <- rnorm(12)  ## random `y` as a response
lm(y ~ X - 1)  ## drop intercept as `X` has intercept already

#X0(Intercept)           X0f1           X0f2           X0f3           X0g1  
#      0.32118        0.05039       -0.22184             NA       -0.92868  
#         X0g2           X0g3  
#     -0.48809             NA

这真正意味着的是，我们必须在7个参数上添加2个线性约束，以获取完整的等级模型。这两个约束是什么并不重要，但是必须有2个线性独立的约束

。
在参数上添加两个总和约束，就像我们需要f1，f2和f3 sum sum to 0的系数，而对于0 G1，g2和g3。
例如，使用正则化，将岭罚款添加到f和g中。

请注意，这三种方式最终以三种不同的解决方案：

对比；
限制最小二乘；
线性混合模型或惩罚最小二乘。

前两个仍处于固定效应建模的范围。通过“对比”，我们减少了参数数量，直到获得完整的等级模型矩阵为止。虽然另外两个并不能减少参数的数量，而是有效地降低了有效的自由度。

现在，您肯定是按照“对比”方式。因此，请记住，我们必须删除2列。它们可以是

f的一列，也可以是g的一列，给予模型〜f + g，f代码>和g对比；
截距，以及一个f或g的一列，给予模型〜f + g -1。

现在，您应该清楚地表明，在删除列的框架内，您无法获得想要的东西，因为您期望仅掉落1列。最终的模型矩阵仍将缺乏等级。

如果您真的想在那里拥有所有系数，请使用受约束的最小二乘或惩罚回归 /线性混合模型。

现在，当我们有各种因素相互作用时，情况就会更加复杂，但是这个想法仍然相同。但是鉴于我的答案已经足够长，我不想继续。

This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.

Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.

Suppose we are interested in a model with two factors: ~ f + g (numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:

set.seed(0)

f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c

g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C

We start with a model matrix with no contrasts at all:

X0 <- model.matrix(~ f + g, contrasts.arg = list(
                   f = contr.treatment(n = 3, contrasts = FALSE),
                   g = contr.treatment(n = 3, contrasts = FALSE)))

#   (Intercept) f1 f2 f3 g1 g2 g3
#1            1  0  0  1  1  0  0
#2            1  1  0  0  0  1  0
#3            1  1  0  0  1  0  0
#4            1  0  1  0  0  1  0
#5            1  0  1  0  0  0  1
#6            1  1  0  0  0  1  0
#7            1  0  0  1  0  0  1
#8            1  0  1  0  1  0  0
#9            1  0  0  1  0  0  1
#10           1  0  1  0  0  0  1
#11           1  1  0  0  1  0  0
#12           1  0  0  1  0  1  0

Note, we have:

unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

unname( rowSums(X0[, c("g1", "g2", "g3")]) ) 
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}. In this full specification, 2 columns are not identifiable. X0 will have column rank 1 + 3 + 3 - 2 = 5:

qr(X0)$rank
# [1] 5

So, if we fit a linear model with this X0, 2 coefficients out of 7 parameters will be NA:

y <- rnorm(12)  ## random `y` as a response
lm(y ~ X - 1)  ## drop intercept as `X` has intercept already

#X0(Intercept)           X0f1           X0f2           X0f3           X0g1  
#      0.32118        0.05039       -0.22184             NA       -0.92868  
#         X0g2           X0g3  
#     -0.48809             NA

What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:

drop any 2 columns from X0;
add two sum-to-zero constrains on parameters, like we require coefficients for f1, f2 and f3 sum to 0, and the same for g1, g2 and g3.
use regularization, for example, adding ridge penalty to f and g.

Note, these three ways end up with three different solutions:

contrasts;
constrained least squares;
linear mixed models or penalized least squares.

The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.

Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be

one column from f and one column from g, giving to a model ~ f + g, with f and g contrasted;
intercept, and one column from either f or g, giving to a model ~ f + g - 1.

Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.

If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.

Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.

回复收藏 0 原文

~没有更多了~