如何计算响应矩阵每一列的最小但快速的线性回归？

发布于 2024-11-13 08:01:46 字数 2909 浏览 10 评论 0原文

我想在 R 中计算普通最小二乘 (OLS) 估计，而不使用“lm”，这有几个原因。首先，考虑到数据大小在我的情况下是一个问题，“lm”还计算了很多我不需要的东西（例如拟合值）。其次，我希望能够在使用另一种语言（例如使用 GSL 的 C 语言）之前先在 R 中实现 OLS。

您可能知道，模型为：Y=Xb+E；与 E ~ N(0, sigma^2)。如下所述，b 是一个具有 2 个参数的向量，即平均值 (b0) 和另一个系数 (b1)。最后，对于我要做的每个线性回归，我想要 b1 （效应大小）的估计值、其标准误差、sigma^2 （残差方差）和 R^2 （确定系数）的估计值。

这是我的数据。我有 N 个样本（例如个体，N~=100）。对于每个样本，我都有 Y 输出（响应变量，Y~=10^3）和 X 点（解释变量，X~=10^6）。我想单独处理 Y 输出，即。我想启动 Y 线性回归：一个用于输出 1，一个用于输出 2，等等。此外，我想使用解释变量 1 y 1：对于输出 1，我想在点 1 上对其进行回归，然后在点 2 上进行回归，然后...最后在 X 点上。（我希望很清楚...！）

这是我的 R 代码，用于检查“lm”的速度与通过矩阵代数自己计算 OLS 估计的速度。

首先，我模拟虚拟数据：

nb.samples <-  10  # N
nb.points <- 1000  # X
x <- matrix(data=replicate(nb.samples,sample(x=0:2,size=nb.points, replace=T)),
            nrow=nb.points, ncol=nb.samples, byrow=F,
            dimnames=list(points=paste("p",seq(1,nb.points),sep=""),
              samples=paste("s",seq(1,nb.samples),sep="")))
nb.outputs <- 10  # Y
y <- matrix(data=replicate(nb.outputs,rnorm(nb.samples)),
            nrow=nb.samples, ncol=nb.outputs, byrow=T,
            dimnames=list(samples=paste("s",seq(1,nb.samples),sep=""),
              outputs=paste("out",seq(1,nb.outputs),sep="")))

这是我自己的函数，如下所示：

GetResFromCustomLinReg <- function(Y, xi){ # both Y and xi are N-dim vectors
  n <- length(Y)
  X <- cbind(rep(1,n), xi)  #
  p <- 1      # nb of explanatory variables, besides the mean
  r <- p + 1  # rank of X: nb of indepdt explanatory variables
  inv.XtX <- solve(t(X) %*% X)
  beta.hat <- inv.XtX %*% t(X) %*% Y
  Y.hat <- X %*% beta.hat
  E.hat <- Y - Y.hat
  E2.hat <- (t(E.hat) %*% E.hat)
  sigma2.hat <- (E2.hat / (n - r))[1,1]
  var.covar.beta.hat <- sigma2.hat * inv.XtX
  se.beta.hat <- t(t(sqrt(diag(var.covar.beta.hat))))
  Y.bar <- mean(Y)
  R2 <- 1 - (E2.hat) / (t(Y-Y.bar) %*% (Y-Y.bar))
  return(c(beta.hat[2], se.beta.hat[2], sigma2.hat, R2))
}

这是我使用内置“lm”的代码：

res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)})

这是我的自定义 OLS 代码：

res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)})

当我使用上面给出的值运行此示例时，我得到：

> system.time( res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.526   0.000   2.528
> system.time( res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.561   0.000   4.561

（当然，当增加 N、X 和 Y 时，情况会变得更糟。）

当然，“lm”具有“自动”单独拟合响应矩阵 (y~xi) 的每一列的良好特性，而我必须使用“应用”Y 次（对于每个 yi~xi）。但这是我的代码速度较慢的唯一原因吗？你们中有人知道如何改进这一点吗？

（很抱歉问了这么长的问题，但我真的试图提供一个最小但全面的例子。）

> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)

原文

I want to compute ordinary least square (OLS) estimates in R without using "lm", and this for several reasons. First, "lm" also computes lots of stuff I don't need (such as the fitted values) considering that data size is an issue in my case. Second, I want to be able to implement OLS myself in R before doing it in another language (eg. in C with the GSL).

As you may know, the model is: Y=Xb+E; with E ~ N(0, sigma^2). As detailed below, b is a vector with 2 parameters, the mean (b0) and another coefficients (b1). At the end, for each linear regression I will do, I want the estimate for b1 (effect size), its standard error, the estimate for sigma^2 (residual variance), and R^2 (determination coef).

Here are my data. I have N samples (eg. individuals, N~=100). For each sample, I have Y outputs (response variables, Y~=10^3) and X points (explanatory variables, X~=10^6). I want to treat the Y outputs separately, ie. I want to launch Y linear regressions: one for output 1, one for output 2, etc. Moreover, I want to use explanatory variables one y one: for output 1, I want to regress it on point 1, then on point 2, then ... finally on point X. (I hope it's clear...!)

Here is my R code to check the speed of "lm" versus computing OLS estimates myself via matrix algebra.

First, I simulate dummy data:

nb.samples <-  10  # N
nb.points <- 1000  # X
x <- matrix(data=replicate(nb.samples,sample(x=0:2,size=nb.points, replace=T)),
            nrow=nb.points, ncol=nb.samples, byrow=F,
            dimnames=list(points=paste("p",seq(1,nb.points),sep=""),
              samples=paste("s",seq(1,nb.samples),sep="")))
nb.outputs <- 10  # Y
y <- matrix(data=replicate(nb.outputs,rnorm(nb.samples)),
            nrow=nb.samples, ncol=nb.outputs, byrow=T,
            dimnames=list(samples=paste("s",seq(1,nb.samples),sep=""),
              outputs=paste("out",seq(1,nb.outputs),sep="")))

Here is my own function used just below:

GetResFromCustomLinReg <- function(Y, xi){ # both Y and xi are N-dim vectors
  n <- length(Y)
  X <- cbind(rep(1,n), xi)  #
  p <- 1      # nb of explanatory variables, besides the mean
  r <- p + 1  # rank of X: nb of indepdt explanatory variables
  inv.XtX <- solve(t(X) %*% X)
  beta.hat <- inv.XtX %*% t(X) %*% Y
  Y.hat <- X %*% beta.hat
  E.hat <- Y - Y.hat
  E2.hat <- (t(E.hat) %*% E.hat)
  sigma2.hat <- (E2.hat / (n - r))[1,1]
  var.covar.beta.hat <- sigma2.hat * inv.XtX
  se.beta.hat <- t(t(sqrt(diag(var.covar.beta.hat))))
  Y.bar <- mean(Y)
  R2 <- 1 - (E2.hat) / (t(Y-Y.bar) %*% (Y-Y.bar))
  return(c(beta.hat[2], se.beta.hat[2], sigma2.hat, R2))
}

Here is my code using the built-in "lm":

res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)})

Here is my custom OLS code:

res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)})

When I run this example with the values given above, I get:

> system.time( res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.526   0.000   2.528
> system.time( res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.561   0.000   4.561

(And, naturally, it gets worse when increasing N, X and Y.)

Of course, "lm" has the nice property of "automatically" fitting separately each column of the response matrix (y~xi), while I have to use "apply" Y times (for each yi~xi). But is this the only reason why my code is slower? Does one of you know how to improve this?

(Sorry for such a long question, but I really tried to provide a minimal, yet comprehensive example.)

> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戴着白色围巾的女孩 2024-11-20 08:01:46

查看 CRAN 上 RcppArmadillo 包中的 fastLm() 函数。在此之前的 RcppGSL 中也有类似的 fastLm() - 但你可能需要基于 Armadillo 的解决方案。我在旧演示文稿中（关于使用 R 的 HPC）中有一些幻灯片显示了速度的提升。

另请注意帮助页面中关于比 X'X 的直接逆更好的“旋转”方法的提示，这对于简并模型矩阵可能很重要。

回复收藏 0 原文

简单爱 2024-11-20 08:01:46

根据 Marek 的评论，下面是比较内置函数“lm”和“lm.fit”、我自己的函数、RcppArmadillo 包中的“fastLm”和“fastLmPure”的结果：

> system.time( res1 <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.859   0.005   2.865
> system.time( res2 <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.620   0.004   4.626
> system.time( res3 <- apply(x, 1, function(xi){lm.fit(x=cbind(1,xi), y=y)}) )
   user  system elapsed
  0.454   0.004   0.458
> system.time( res4 <- apply(x, 1, function(xi){apply(y, 2, fastLm, x=cbind(1,xi))}) )
   user  system elapsed
  2.279   0.005   2.283
> system.time( res5 <- apply(x, 1, function(xi){apply(y, 2, fastLmPure, cbind(1,xi))}) )
   user  system elapsed
  1.053   0.003   1.056

但是，比较这些数字时要小心。差异不仅在于不同的实现，还在于有效计算的结果：

> names(res1$p1)
 [1] "coefficients"  "residuals"     "effects"       "rank"        
 [5] "fitted.values" "assign"        "qr"            "df.residual" 
 [9] "xlevels"       "call"          "terms"         "model"       
> # res2 (from my own custom function) returns the estimate of beta, its standard error, the estimate of sigma and the R^2
> names(res3$p1)
[1] "coefficients"  "residuals"     "effects"       "rank"        
[5] "fitted.values" "assign"        "qr"            "df.residual" 
> names(res4$p1$out1)
[1] "coefficients"  "stderr"        "df"            "fitted.values"
[5] "residuals"     "call"        
> names(res5$p1$out1)
[1] "coefficients" "stderr"       "df"

例如，我们可能更喜欢使用“lm.fit”而不是“lm”，但如果我们需要 R^2，我们将有我们自己计算一下。同上，我们可能想使用“fastLm”而不是“lm”，但如果我们想要西格玛的估计，我们将不得不自己计算它。使用自定义 R 函数计算此类内容可能不是很有效（与“lm”所做的相比）。

鉴于这一切，我暂时将继续使用“lm”，但确实德克关于“fastLm”的评论是很好的建议（这就是我选择他的答案的原因，因为其他人应该对此感兴趣）。

Following Marek's comment, below are the results of comparing the built-in functions "lm" and "lm.fit", my own function, "fastLm" and "fastLmPure" from the package RcppArmadillo:

> system.time( res1 <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.859   0.005   2.865
> system.time( res2 <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.620   0.004   4.626
> system.time( res3 <- apply(x, 1, function(xi){lm.fit(x=cbind(1,xi), y=y)}) )
   user  system elapsed
  0.454   0.004   0.458
> system.time( res4 <- apply(x, 1, function(xi){apply(y, 2, fastLm, x=cbind(1,xi))}) )
   user  system elapsed
  2.279   0.005   2.283
> system.time( res5 <- apply(x, 1, function(xi){apply(y, 2, fastLmPure, cbind(1,xi))}) )
   user  system elapsed
  1.053   0.003   1.056

However, be careful when comparing these numbers. The differences are due not only to the different implementations, but also to which results are effectively computed:

> names(res1$p1)
 [1] "coefficients"  "residuals"     "effects"       "rank"        
 [5] "fitted.values" "assign"        "qr"            "df.residual" 
 [9] "xlevels"       "call"          "terms"         "model"       
> # res2 (from my own custom function) returns the estimate of beta, its standard error, the estimate of sigma and the R^2
> names(res3$p1)
[1] "coefficients"  "residuals"     "effects"       "rank"        
[5] "fitted.values" "assign"        "qr"            "df.residual" 
> names(res4$p1$out1)
[1] "coefficients"  "stderr"        "df"            "fitted.values"
[5] "residuals"     "call"        
> names(res5$p1$out1)
[1] "coefficients" "stderr"       "df"

For instance, we may prefer to use "lm.fit" over "lm", but if we need the R^2, we will have to compute it by ourselves. Idem, we may want to use "fastLm" over "lm", but if we want the estimate of sigma, we will have to compute it by ourselves. And computing such things with a custom R function may not be very efficient (compare to what is done by "lm").

In the light of all this, I will keep using "lm" for the moment, but indeed Dirk's comment about "fastLm" is good advice (that's why I chose his answer, as it should be of interest for other people).

回复收藏 0 原文

~没有更多了~