如何将矩阵列的列作为R中的线性回归中的预测值?
问题陈述:可以通过数据(汽油,package =“ pls”)找到60种汽油和相应辛烷值的一些近红外光谱。
计算计算的平均值每个频率并使用问题4中的五种不同方法预测最佳模型的响应。
注意:这是练习11.5,在 r,r,第二版,的线性模型中朱利安遥远。同样,“来自问题4的五种不同方法”是:具有所有预测指标的线性回归,使用AIC,主成分回归,部分最小二乘和脊回归选择的变量的线性回归。
到目前为止,我的工作:我们
require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)
到目前为止做得很好。但是,如果我查看模型的摘要,我会发现348个系数未因奇异性而定义。 (为什么?)此外,将NIR
矩阵(预测变量)的列的平均值变成可接受的数据框架。
我的问题:如何到达高度曲线的预测
函数将使我做这样的事情:
new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)
?
顺便说一句,正如我对统计数据的大量调节一样,我可以积极地断言这个问题将在Stats.se上被关闭。这是一个“编程或数据请求”,因此在Stats.se上不受欢迎。
我还查找了一些相关问题,但似乎完全不合适。
Problem Statement: Some near infrared spectra on 60 samples of gasoline and corresponding octane numbers can be found by data(gasoline, package="pls").
Compute the mean value for each frequency and predict the response for the best model using the five different methods from Question 4.
Note: This is Exercise 11.5 in Linear Models with R, 2nd Ed., by Julian Faraway. Also, the "five different methods from Question 4" are: linear regression with all predictors, linear regression with variables selected using AIC, principal component regression, partial least squares, and ridge regression.
My Work So Far: We do
require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)
So far, so good. However, if I look at a summary of the model, I find that 348 coefficients are not defined because of singularities. (Why?) Moreover, massaging the mean values of the columns of the NIR
matrix (the predictors) into an acceptable data frame is proving difficult.
My Question: How can I get to the point where the highly-fussy predict
function will let me do something like this:
new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)
?
Incidentally, as I have done a significant amount of moderating on Stats.SE, I can assert positively that this question would be closed on Stats.SE as being off-topic. It's a "programming or data request", and hence unwelcome on Stats.SE.
I have also looked up a few related questions on SO, but nothing seems to fit exactly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这看起来确实很漂亮 crossvalidated 对我... '(元素)是一个401列矩阵:
但是,基本问题是这是一个p>> n问题;有60个观察结果和401个预测因子。因此,标准线性回归可能不会有意义 - 您可能想使用诸如lasso/ridge(即,
glmnet
)之类的惩罚方法。 (NCOLS + 1)...)这就是为什么您从60个观察值中获得未定义的系数(如果没有某种惩罚,您就无法估计402个系数 我们可以执行线性模型和预测(但是不明显):
a 稍微更直接的方法(但仍然丑陋)是将模型拟合到原始怪异结构并构造与该模型相匹配的预测框架怪异的结构,即,
如果您愿意放弃
preditive()
您可以这样做:This does seem pretty CrossValidated-ish to me ...
gasoline
is a rather odd object, containing a 'column' (element) that is a 401-column matrix:However, the fundamental problem is that this is a p>>n problem; there are 60 observations and 401 predictors. Thus, a standard linear regression probably just won't make sense - you probably want to use a penalized approach like LASSO/ridge (i.e.,
glmnet
). This is why you get the undefined coefficients (without some kind of penalization, you can't estimate 402 coefficients (ncols + 1 for the intercept) from 60 observations ...)However, if we do want to hack this into a shape where we can do the linear model and prediction (however ill-advised):
A slightly more direct approach (but still ugly) is to fit the model to the original weird structure and construct a prediction frame that matches that weird structure, i.e.
If you were willing to forgo
predict()
you could do it like this: