当预测值没有方差时,为什么 lm 返回值?
考虑以下 R 代码(我认为它最终会调用一些 Fortran):
X <- 1:1000
Y <- rep(1,1000)
summary(lm(Y~X))
为什么摘要返回值?由于 Y 没有方差,这个模型是否应该无法拟合?更重要的是,为什么模型 R^2 ~= .5?
编辑
我跟踪了从 lm 到 lm.fit 的代码,可以看到这个调用:
z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,
tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,
effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),
work = double(2 * p), PACKAGE = "base")
这就是实际拟合发生的地方。查看 http://svn.r-project.org/R /trunk/src/appl/dqrls.f)并没有帮助我理解发生了什么,因为我不懂fortran。
Consider the following R code (which, I think, eventually calls some Fortran):
X <- 1:1000
Y <- rep(1,1000)
summary(lm(Y~X))
Why are values returned by summary? Shouldn't this model fail to fit since there is no variance in Y? More importantly, why is the model R^2 ~= .5?
Edit
I tracked the code from lm to lm.fit and can see this call:
z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,
tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,
effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),
work = double(2 * p), PACKAGE = "base")
That is where the actual fit seems to happen. Looking at http://svn.r-project.org/R/trunk/src/appl/dqrls.f) did not help me understand what is going on, because I do not know fortran.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
从统计学上来说,我们应该预期什么(我想说“预期”,但这是一个非常具体的术语;-))?系数应该是(0,1),而不是“无法拟合”。假设 (X,Y) 的协方差与 X 的方差成正比,而不是相反。由于 X 具有非零方差,因此没有问题。由于协方差为 0,X 的估计系数应为 0。因此,在机器容差范围内,这就是您得到的答案。
这里不存在统计异常。可能存在统计上的误解。还有机器容差的问题,但考虑到预测变量和响应值的规模,1E-19 数量级的系数可以忽略不计。
更新 1:可以在此维基百科页面上找到简单线性回归的快速回顾。需要注意的关键是
Var(x)
位于分母中,Cov(x,y)
位于分子中。在本例中,分子为 0,分母非零,因此没有理由期待NaN
或NA
。然而,有人可能会问为什么x
的结果系数不是0
,这与 QR 分解的数值精度问题有关。Statistically speaking, what should we anticipate (I'd like to say "expect", but that's a very specific term ;-))? The coefficients should be (0,1), rather than "fail to fit". The covariance of (X,Y) is assumed proportional to the variance of X, not the other way around. As X has non-zero variance, there is no problem. As the covariance is 0, the estimated coefficient for X should be 0. So, within machine tolerance, this is the answer you're getting.
There is no statistical anomaly here. There may be a statistical misunderstanding. There's also the issue of machine tolerance, but a coefficient on the order of 1E-19 is rather negligible, given the scale of the predictor and response values.
Update 1: A quick review of simple linear regression can be found on this Wikipedia page. The key thing to note is that
Var(x)
is in the denominator,Cov(x,y)
in the numerator. In this case, the numerator is 0, the denominator is non-zero, so there is no reason to expect aNaN
orNA
. However, one may ask why isn't the resulting coefficient forx
a0
, and that has to do with numerical precision issues of the QR decomposition.我相信这只是因为 QR 分解是用浮点运算实现的。
singular.ok
参数实际上指的是设计矩阵(即仅 X)。尝试对比
I believe this is simply because the QR decomposition is implemented with floating point arithmetic.
The
singular.ok
parameter actually refers to the design matrix (i.e. X only). Tryvs.
我同意问题可能出在浮点上。但我不认为这是奇点。
如果您使用
solve(t(x1)%*%x1)%*%(t(x1)%*%Y)
而不是 QR 进行检查,则(t(x1)%*% x1)
不是单数使用
x1 = cbind(rep(1,1000,X)
因为lm(Y~X)
包含截距。I agree that the problem might be of floating point. but I don't think is singularity.
If you check using
solve(t(x1)%*%x1)%*%(t(x1)%*%Y)
instead of QR,(t(x1)%*%x1)
is not singularuse
x1 = cbind(rep(1,1000,X)
becauselm(Y~X)
includes the intercept.