r regression linear-regression lm categorical-data

将所有因子水平与宏伟的平均值进行比较：我可以在线性模型拟合中调整对比度以显示所有级别吗？

发布于 2025-02-12 01:51:23 字数 1650 浏览 2 评论 0原文

我试图在线性模型上调整对比度编码，在该模型中，我想知道每个因素的每个级别是否与均值显着不同。

假设因素具有“ A”，“ B”和“ C”的水平。最常见的控制处理对比显然将“ A”定为参考级别，并将“ B”和“ C”比较。这不是我想要的，因为“ A”级别没有显示在模型摘要中。

偏差编码似乎也没有给我我想要的东西，因为它将“ C”级的对比矩阵设置为[-1，-1，-1]，现在此级别没有显示在模型摘要中。

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
summary(fit)

此外，报告的级别名称已从“ A”，“ B”更改为“ 1”和“ 2”。

Call:
lm(formula = y ~ x, contrasts = list(x = contr.sum))

Residuals:
     1      2      3      4      5      6 
-0.405  0.405 -1.215  1.215  0.575 -0.575 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02902    0.46809  -0.062    0.954
x1          -0.19239    0.66198  -0.291    0.790
x2           0.40885    0.66198   0.618    0.581

Residual standard error: 1.147 on 3 degrees of freedom
Multiple R-squared:  0.1129,    Adjusted R-squared:  -0.4785 
F-statistic: 0.1909 on 2 and 3 DF,  p-value: 0.8355

我想念什么吗？我是否应该添加一个与大平均值相等的虚拟变量，以便可以将其用作参考级别？

去年，我看到了一个类似的问题（但也许更要求），但是没有解决方案（尚未解决）：模型>型号，与“平均值”差异'分类变量的所有系数；获取“对比度编码”进行操作？。

这里接受的答案有效，但作者尚未提供解释。我已经在统计信息中询问了有关它的问题： https：//stats.stackexchange.com/questions/600798/600798/understanding-the-process-ocess-of-tweaking-contrasts-in-linear-model-model-model-fittit-to-show

原文

I am trying to tweak contrast coding on a linear model where I want to know if each level of a factor is significantly different from the grand mean.

Let’s say the factor has levels "A", "B" and "C". The most common control-treatment contrasts obviously set "A" as the reference level, and compare "B" and "C" to that. This is not what I want, because level "A" does not show up in model summary.

Deviation coding also doesn’t seem to give me what I want, since it sets the contrast matrix for level "C" to [-1,-1,-1], and now this level does not show up in model summary.

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
summary(fit)

In addition, the reported level names have changed from "A", "B" to "1" and "2".

Call:
lm(formula = y ~ x, contrasts = list(x = contr.sum))

Residuals:
     1      2      3      4      5      6 
-0.405  0.405 -1.215  1.215  0.575 -0.575 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02902    0.46809  -0.062    0.954
x1          -0.19239    0.66198  -0.291    0.790
x2           0.40885    0.66198   0.618    0.581

Residual standard error: 1.147 on 3 degrees of freedom
Multiple R-squared:  0.1129,    Adjusted R-squared:  -0.4785 
F-statistic: 0.1909 on 2 and 3 DF,  p-value: 0.8355

Am I missing something? Should I add a dummy variable that is equal to the grand mean, so that I can use this as the reference level?

I saw a similar question (but maybe more demanding) asked last year, but without solution (yet): models with 'differences from mean' for all coefficients on categorical variables; get 'contrast coding' to do it?.

The accepted answer here works, but the author has not provided an explanation. I have asked about it on the stats SE: https://stats.stackexchange.com/questions/600798/understanding-the-process-of-tweaking-contrasts-in-linear-model-fitting-to-show

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪失控 2025-02-19 01:51:24

对比是模型参数的线性组合。计算它们需要一些线性代数。您可以手动执行其他答案中所示。或者，您可以使用几个软件包之一为您完成数学。

我将使用对比度软件包。阅读其小插图在这里。

library("contrast")

set.seed(1)

xlevels <- LETTERS[1:3]
y <- rnorm(6, 0, 1)
x <- factor(rep(xlevels, each = 2))

NB。您无需更改模型的参数化（lm中的对比参数设置）。您可以使用任何参数化计算任何感兴趣的对比，因此为什么不保留默认的参数化。

使用默认参数化拟合线性回归。

fit <- lm(y ~ x)

默认参数化具有作为参考级别。

broom::tidy(fit)
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)  -0.221      0.811   -0.273    0.803
#> 2 xB            0.601      1.15     0.524    0.636
#> 3 xC           -0.0241     1.15    -0.0210   0.985

依次将每个级别与三个级别的平均值进行比较。

for (xi in xlevels) {
  print(
    paste("Contrast", xi, "with the average")
  )
  print(
    contrast(
      fit,
      list(x = xi),
      list(x = xlevels),
      type = "average"
    )
  )
}
#> [1] "Contrast A with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower    Upper     t df Pr(>|t|)
#> 1 -0.1923854 0.6619845 -2.299116 1.914345 -0.29  3   0.7903
#> [1] "Contrast B with the average"
#> lm model parameter contrast
#> 
#>    Contrast      S.E.     Lower    Upper    t df Pr(>|t|)
#> 1 0.4088459 0.6619845 -1.697884 2.515576 0.62  3   0.5805
#> [1] "Contrast C with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower   Upper     t df Pr(>|t|)
#> 1 -0.2164605 0.6619845 -2.323191 1.89027 -0.33  3   0.7652

^使用

Contrasts are linear combinations of the model parameters. Calculating them requires a bit of linear algebra. You can do it by hand as demonstrated in the other answers. Or you can use one of several packages do the math for you.

I'll use the contrast package. Read its vignette here.

library("contrast")

set.seed(1)

xlevels <- LETTERS[1:3]
y <- rnorm(6, 0, 1)
x <- factor(rep(xlevels, each = 2))

NB. You don't need to change the parametrization of the model (by setting with contrasts argument in lm). You can compute any contrast of interest with any parametrization, so why not keep the default parametrization.

Fit a linear regression with the default parametrization.

fit <- lm(y ~ x)

The default parametrization has A as the reference level.

broom::tidy(fit)
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)  -0.221      0.811   -0.273    0.803
#> 2 xB            0.601      1.15     0.524    0.636
#> 3 xC           -0.0241     1.15    -0.0210   0.985

Compare each level in turn to the average of the three levels.

for (xi in xlevels) {
  print(
    paste("Contrast", xi, "with the average")
  )
  print(
    contrast(
      fit,
      list(x = xi),
      list(x = xlevels),
      type = "average"
    )
  )
}
#> [1] "Contrast A with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower    Upper     t df Pr(>|t|)
#> 1 -0.1923854 0.6619845 -2.299116 1.914345 -0.29  3   0.7903
#> [1] "Contrast B with the average"
#> lm model parameter contrast
#> 
#>    Contrast      S.E.     Lower    Upper    t df Pr(>|t|)
#> 1 0.4088459 0.6619845 -1.697884 2.515576 0.62  3   0.5805
#> [1] "Contrast C with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower   Upper     t df Pr(>|t|)
#> 1 -0.2164605 0.6619845 -2.323191 1.89027 -0.33  3   0.7652

^{Created on 2023-01-07 with reprex v2.0.2}

回复收藏 0 原文

小兔几 2025-02-19 01:51:24

只做一系列t检验，每个将子集的平均值与宏伟的平均值进行比较呢？

lapply(levels(x), \(i) t.test(y,y[x==i]))

What about just doing a series of t-tests, each one comparing the mean of the subset to the grand mean?

lapply(levels(x), \(i) t.test(y,y[x==i]))

回复收藏 0 原文

无所谓啦 2025-02-19 01:51:23

此答案向您显示了如何获得以下系数表：

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

令人惊讶，不是吗？它可以模仿您从摘要（fit）中看到的，

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#x1          -0.19238543  0.6619845 -0.29061922 0.7902750
#x2           0.40884591  0.6619845  0.61760645 0.5805485

但是现在我们显示了所有因子级别。

为什么`lm`摘要不显示所有因子级别？

在2016年，我回答了这个堆栈溢出问题：`lm`摘要不显示所有因素级别，此后，它已经开始了标记有关类似主题的重复问题的目标。

回顾一下，基本思想是，要使最小二乘配件的全级设计矩阵，我们必须对因子变量进行对比度。假设该因子具有 n 级别，那么无论我们选择哪种类型的对比度（请参阅列表的对比），它都会减少RAW n < /em>虚拟变量到新的 n -1 变量。因此，仅 n -1 系数与 n 级别的因子相关联。

但是，我们可以使用对比矩阵将 n -1 系数转换为原始 n 系数。转换使我们能够获得所有因子水平的系数表。现在，我将根据OP的可重复示例演示如何执行此操作：

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))

在此示例中，总和到零对比度应用于因子x。要了解如何控制模型拟合的对比度，请参见我的答案，请参阅如何为我与R？ /a>。

r代码划分

的 n 级别的因子变量受到总和至零对比的影响，我们可以使用以下函数获取 n x（n-1）映射（n -1）系数的转换矩阵，该系数由lm估计的所有级别的 n 系数。

ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

对于示例3级因子x，此矩阵为：

Cmat <- ContrSumMat(x)
#   1  2
#A  1  0
#B  0  1
#C -1 -1

拟合模型fit报告3-1 = 2个系数。我们可以将它们提取为：

## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
#        x1         x2 
#-0.1923854  0.4088459

因此，特定水平的系数是：

## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
#         A          B          C 
#-0.1923854  0.4088459 -0.2164605 

## note that they sum to zero as expected
sum(coef_bc)
#[1] 0

类似地，我们可以在对比度后获得协方差矩阵：

var_ac <- vcov(fit)[2:3, 2:3]
#           x1         x2
#x1  0.4382235 -0.2191118
#x2 -0.2191118  0.4382235

并在对比之前将其转换为一个：

var_bc <- Cmat %*% var_ac %*% t(Cmat)
#           A          B          C
#A  0.4382235 -0.2191118 -0.2191118
#B -0.2191118  0.4382235 -0.2191118
#C -0.2191118 -0.2191118  0.4382235

解释：

模型截距coef（fit）[1]是大平均值。
计算的coef_bc是每个级别平均值的偏差。
var_bc的对角线条目给出了这些偏差的估计差异。

然后，我们可以继续计算这些系数的T统计量和P值，如下所示。

## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
#        A         B         C 
#0.6619845 0.6619845 0.6619845 

## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
#         A          B          C 
#-0.2906192  0.6176065 -0.3269872 

## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
#        A         B         C 
#0.7902750 0.5805485 0.7651640 

## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
                      "Std. Error" = std.error_bc,
                      "t value" = t.stats_bc,
                      "Pr(>|t|)" = p.value_bc)
#    Estimate Std. Error    t value  Pr(>|t|)
#A -0.1923854  0.6619845 -0.2906192 0.7902750
#B  0.4088459  0.6619845  0.6176065 0.5805485
#C -0.2164605  0.6619845 -0.3269872 0.7651640

我们还可以通过包括大平均值的结果（即模型拦截）来增强它。

## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655

## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

我们还可以用重要的星星打印这张桌子。

printCoefmat(stats.tab)
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902    0.46809 -0.0620   0.9545
#A           -0.19239    0.66199 -0.2906   0.7903
#B            0.40885    0.66199  0.6176   0.5805
#C           -0.21646    0.66199 -0.3270   0.7652

emm？为什么没有星星？好吧，在此示例中，所有p值都很大。如果P值很小，则星星将出现。这是一个令人信服的演示：

fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -0.02902    0.46809 -0.0620 0.009545 **
#A           -0.19239    0.66199 -0.2906 0.007903 **
#B            0.40885    0.66199  0.6176 0.005805 **
#C           -0.21646    0.66199 -0.3270 0.007652 **
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

哦，这是如此美丽。有关这些星星的含义，请参阅我的答案： ANOVA表的互动r显着性代码吗？

结束言论。

应编写一个函数（甚至 r 软件包）以执行此类表变换。但是，要使这种功能足够灵活，可能需要付出巨大的努力来处理：

所有类型的对比（这很容易做到）;
复杂的模型术语，例如因子和其他数字/因素变量之间的相互作用（这似乎确实涉及！！）。

因此，我现在将在这里停留。

其他答复

我从LM的摘要中获得的模型得分仍然准确，即使它没有显示所有级别？

是的，他们是。 lm执行准确的最小二乘拟合。

此外，系数表的转换不会影响R平方，自由度，残差，拟合值，F统计数据，ANOVA表，等等。

This answer shows you how to obtain the following coefficient table:

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

Amazing, isn't it? It mimics what you see from summary(fit), i.e.,

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#x1          -0.19238543  0.6619845 -0.29061922 0.7902750
#x2           0.40884591  0.6619845  0.61760645 0.5805485

But now we have all factor levels displayed.

Why `lm` summary does not display all factor levels?

In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.

To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has N levels, then no matter what type of contrasts we choose (see ?contrasts for a list), it reduces the raw N dummy variables to a new set of N - 1 variables. Therefore, only N - 1 coefficients are associated with an N-level factor.

However, we can transform the N - 1 coefficients back to the original N coefficients using the contrasts matrix. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))

In this example, the sum-to-zero contrast is applied to factor x. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.

R code walk-through

For a factor variable of N levels subject to sum-to-zero contrasts, we can use the following function to get the N x (N - 1) transformation matrix that maps the (N - 1) coefficients estimated by lm back to the N coefficients for all levels.

ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

For the example 3-level factor x, this matrix is:

Cmat <- ContrSumMat(x)
#   1  2
#A  1  0
#B  0  1
#C -1 -1

The fitted model fit reports 3 - 1 = 2 coefficients for this factor. We can extract them as:

## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
#        x1         x2 
#-0.1923854  0.4088459

Therefore, the level-specific coefficients are:

## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
#         A          B          C 
#-0.1923854  0.4088459 -0.2164605 

## note that they sum to zero as expected
sum(coef_bc)
#[1] 0

Similarly, we can get the covariance matrix after contrasts:

var_ac <- vcov(fit)[2:3, 2:3]
#           x1         x2
#x1  0.4382235 -0.2191118
#x2 -0.2191118  0.4382235

and transform it to the one before contrasts:

var_bc <- Cmat %*% var_ac %*% t(Cmat)
#           A          B          C
#A  0.4382235 -0.2191118 -0.2191118
#B -0.2191118  0.4382235 -0.2191118
#C -0.2191118 -0.2191118  0.4382235

Interpretation:

The model intercept coef(fit)[1] is the grand mean.
The computed coef_bc is the deviation of each level's mean from the grand mean.
The diagonal entries of var_bc gives the estimated variance of these deviations.

We can then proceed to compute t-statistics and p-values for these coefficients, as follows.

## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
#        A         B         C 
#0.6619845 0.6619845 0.6619845 

## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
#         A          B          C 
#-0.2906192  0.6176065 -0.3269872 

## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
#        A         B         C 
#0.7902750 0.5805485 0.7651640 

## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
                      "Std. Error" = std.error_bc,
                      "t value" = t.stats_bc,
                      "Pr(>|t|)" = p.value_bc)
#    Estimate Std. Error    t value  Pr(>|t|)
#A -0.1923854  0.6619845 -0.2906192 0.7902750
#B  0.4088459  0.6619845  0.6176065 0.5805485
#C -0.2164605  0.6619845 -0.3269872 0.7651640

We can also augment it by including the result for the grand mean (i.e., the model intercept).

## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655

## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

We can also print this table with significance stars.

printCoefmat(stats.tab)
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902    0.46809 -0.0620   0.9545
#A           -0.19239    0.66199 -0.2906   0.7903
#B            0.40885    0.66199  0.6176   0.5805
#C           -0.21646    0.66199 -0.3270   0.7652

Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:

fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -0.02902    0.46809 -0.0620 0.009545 **
#A           -0.19239    0.66199 -0.2906 0.007903 **
#B            0.40885    0.66199  0.6176 0.005805 **
#C           -0.21646    0.66199 -0.3270 0.007652 **
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?

Closing Remarks

It should be possible to write a function (or even an R package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:

all type of contrasts (this is easy to do);
complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).

So, I will stop here for the moment.

Miscellaneous Replies

Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?

Yes, they are. lm conducts accurate least squares fitting.

In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.

回复收藏 0 原文

白首有我共你 2025-02-19 01:51:23

这仅仅是初步功能化版本@zheyuan's 惊人的答案。正如@zheyuan还提到的那样，此答案存在局限性，包括具有对比度等的复杂模型。我认为如果订购了该因素，该功能就不行。

奶嘴数据：

set.seed(1)
test.df <- data.frame(y = rnorm(10, 0, 1),
                     x = factor(rep(LETTERS[1:5], each = 2)))
test.fit <- lm(y ~ x, contrasts = list(x=contr.sum), data= test.df)

返回级别统计数据的功能：

# Output deviation coded factor matrix, used within `DevContrStats()`
ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

## `dat` is the original data
## `fit.model` is the linear model created from the data
## `fctr_col` is the colum you'd like to get all factor levels from
DevContrStats <- function(dat, fit.model, fctr_col) {
  
  fctr.mtx <- ContrSumMat(factor(dat[, fctr_col]))
  
  ## coefficients After Contrasts
  coef_a_contr <- coef(fit.model)[-1]

  ## coefficients Before Contrasts
  ## coef_bc tells how each factor level deviates from the grand mean.
  coef_b_contr <- (fctr.mtx %*% coef_a_contr)[, 1]

  ## Covariance matrix after contrasts:
  var_a_contr <- vcov(fit.model)[-1,-1]

  ## Transform covariance matrix (after contrasts) to the one before contrasts:
  ## The diagonal of var_bc gives the estimated variance of factor-level deviations.
  var_b_contr <- fctr.mtx %*% var_a_contr %*% t(fctr.mtx)

  ## standard error of point estimate `coef_bc`
  std.err_b_contr <- sqrt(diag(var_b_contr))

  ## t-statistics (Null Hypothesis: coef_bc = 0)
  t.stats_b_contr <- coef_b_contr / std.err_b_contr

  ## p-values of the t-statistics
  p.value_b_contr <- 2 * pt(abs(t.stats_b_contr),
                            df = fit.model$df.residual,
                            lower.tail = FALSE)

  ## construct a coefficient table that mimics `coef(summary(fit))`
  stats.tab_b_contr <- cbind("Estimate" = coef_b_contr,
                        "Std. Error" = std.err_b_contr,
                        "t value" = t.stats_b_contr,
                        "Pr(>|t|)" = p.value_b_contr)

  ## extract statistics of the intercept, which = grand mean
  intercept.stats <- coef(summary(fit.model))[1, , drop = FALSE]

  ## augment the coefficient table with the intercept stats
  stats.tab <- rbind(intercept.stats, stats.tab_b_contr)

  ## print stats table with significance stars
  printCoefmat(stats.tab)
}


17:20:20> DevContrStats(test.df, test.fit, 'x')
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.164183   0.245068  0.6699  0.53258  
A            0.448694   0.490137  0.9154  0.40195  
B           -0.028986   0.490137 -0.0591  0.95513  
C            0.786629   0.490137  1.6049  0.16942  
D           -1.582153   0.490137 -3.2280  0.02326 *
E            0.375816   0.490137  0.7668  0.47785  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This is merely a preliminary function-ized version @Zheyuan's amazing answer. As @Zheyuan also mentions, there are limitations to this answer, including complicated models with contrasts, etc. I don't believe the function works if the factor is ordered.

The teat data:

set.seed(1)
test.df <- data.frame(y = rnorm(10, 0, 1),
                     x = factor(rep(LETTERS[1:5], each = 2)))
test.fit <- lm(y ~ x, contrasts = list(x=contr.sum), data= test.df)

The functions to return the level stats:

# Output deviation coded factor matrix, used within `DevContrStats()`
ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

## `dat` is the original data
## `fit.model` is the linear model created from the data
## `fctr_col` is the colum you'd like to get all factor levels from
DevContrStats <- function(dat, fit.model, fctr_col) {
  
  fctr.mtx <- ContrSumMat(factor(dat[, fctr_col]))
  
  ## coefficients After Contrasts
  coef_a_contr <- coef(fit.model)[-1]

  ## coefficients Before Contrasts
  ## coef_bc tells how each factor level deviates from the grand mean.
  coef_b_contr <- (fctr.mtx %*% coef_a_contr)[, 1]

  ## Covariance matrix after contrasts:
  var_a_contr <- vcov(fit.model)[-1,-1]

  ## Transform covariance matrix (after contrasts) to the one before contrasts:
  ## The diagonal of var_bc gives the estimated variance of factor-level deviations.
  var_b_contr <- fctr.mtx %*% var_a_contr %*% t(fctr.mtx)

  ## standard error of point estimate `coef_bc`
  std.err_b_contr <- sqrt(diag(var_b_contr))

  ## t-statistics (Null Hypothesis: coef_bc = 0)
  t.stats_b_contr <- coef_b_contr / std.err_b_contr

  ## p-values of the t-statistics
  p.value_b_contr <- 2 * pt(abs(t.stats_b_contr),
                            df = fit.model$df.residual,
                            lower.tail = FALSE)

  ## construct a coefficient table that mimics `coef(summary(fit))`
  stats.tab_b_contr <- cbind("Estimate" = coef_b_contr,
                        "Std. Error" = std.err_b_contr,
                        "t value" = t.stats_b_contr,
                        "Pr(>|t|)" = p.value_b_contr)

  ## extract statistics of the intercept, which = grand mean
  intercept.stats <- coef(summary(fit.model))[1, , drop = FALSE]

  ## augment the coefficient table with the intercept stats
  stats.tab <- rbind(intercept.stats, stats.tab_b_contr)

  ## print stats table with significance stars
  printCoefmat(stats.tab)
}


17:20:20> DevContrStats(test.df, test.fit, 'x')
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.164183   0.245068  0.6699  0.53258  
A            0.448694   0.490137  0.9154  0.40195  
B           -0.028986   0.490137 -0.0591  0.95513  
C            0.786629   0.490137  1.6049  0.16942  
D           -1.582153   0.490137 -3.2280  0.02326 *
E            0.375816   0.490137  0.7668  0.47785  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

回复收藏 0 原文

~没有更多了~