测试数据中因子水平未知的 Predict.lm()
我正在拟合一个模型来分解数据并进行预测。如果 predict.lm()
中的 newdata
包含模型未知的单个因子水平,则 predict.lm 的全部 ()
失败并返回错误。
有没有一种好方法可以让 predict.lm()
返回模型已知的因子水平的预测以及未知因子水平的 NA,而不仅仅是错误?
示例代码:
foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)
我希望最后一个命令返回对应于因子级别“A”、“B”和“C”的三个“真实”预测以及对应于未知级别“D”的 NA
。
I am fitting a model to factor data and predicting. If the newdata
in predict.lm()
contains a single factor level that is unknown to the model, all of predict.lm()
fails and returns an error.
Is there a good way to have predict.lm()
return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?
Example code:
foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)
I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA
corresponding to the unknown level "D".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您必须在进行任何计算之前删除额外的级别,例如:
这是一种更通用的方法,它将把原始数据中未出现的所有级别设置为 NA。正如 Hadley 在评论中提到的,他们本可以选择将其包含在
predict()
函数中,但他们没有。如果您查看计算本身,为什么必须这样做就变得显而易见。在内部,预测计算如下:
在底部有两个模型矩阵。您会看到 foo.new 的那个有一个额外的列,因此您不能再使用矩阵计算。如果您使用新的数据集进行建模,您还将获得一个不同的模型,该模型在额外级别上具有额外的虚拟变量。
您也不能只删除模型矩阵中的最后一列,因为即使您这样做,其他两个级别仍然会受到影响。级别
A
的代码为 (0,0)。对于B
来说是 (1,0),对于C
来说是 (0,1) ...对于D
来说又是 (0 ,0)!因此,如果您的模型天真地删除最后一个虚拟变量,则它会假设A
和D
是同一级别。在更理论的部分:可以在没有所有级别的情况下构建模型。现在,正如我之前尝试解释的那样,该模型仅对于您在构建模型时使用的级别有效。如果你遇到新的关卡,你必须构建一个新的模型来包含额外的信息。如果您不这样做,您唯一能做的就是从数据集中删除额外的级别。但随后您基本上会丢失其中包含的所有信息,因此通常不被认为是好的做法。
You have to remove the extra levels before any calculation, like:
This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the
predict()
function, but they didn'tWhy you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :
At the bottom you have both model matrices. You see that the one for
foo.new
has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level
A
would be (0,0). ForB
this is (1,0), forC
this (0,1) ... and forD
it is again (0,0)! So your model would assume thatA
andD
are the same level if it would naively drop the last dummy variable.On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.
如果您想在创建 lm 模型之后但在调用预测之前处理数据中缺失的级别(假设我们事先不知道可能会丢失哪些级别),这里是我构建的函数,用于设置不在模型到 NA - 预测也将给出 NA,然后您可以使用替代方法来预测这些值。
object 将是 lm(...,data=trainData) 的 lm 输出
data 将是您要为其创建预测的数据框
If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.
object will be your lm output from lm(...,data=trainData)
data will be the data frame you want to create predictions for
MorgenBall 整理并扩展了该功能。现在它也在 sperrorest 中实现。
其他功能
NA
。test_data
中是否存在因子变量,如果不存在,则返回原始 data.framelm
,glm
以及glmmPQL
注意:此处显示的函数可能会随着时间的推移而改变(改进)。
我们可以将此函数应用到问题中的示例中,如下所示:
在尝试改进此函数时,我发现 SL 学习方法如
lm
、glm
等。火车和火车需要相同的水平如果删除级别,则测试 ML 学习方法(svm
、randomForest
)会失败。这些方法需要所有级别的训练和训练。测试。通用的解决方案很难实现,因为每个拟合模型都有不同的方式来存储其因子级别组件(
fit$xlevels
用于lm
和fit$contrasts< /code> 用于
glmmPQL
)。至少它在lm
相关模型中似乎是一致的。Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.
Additional features
NA
.test_data
and returns original data.frame if non are presentlm
,glm
and but also forglmmPQL
Note: The function shown here may change (improve) over time.
We can apply this function to the example in the question as follows:
While trying to improve this function, I came across the fact that SL learning methods like
lm
,glm
etc. need the same levels in train & test while ML learning methods (svm
,randomForest
) fail if the levels are removed. These methods need all levels in train & test.A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (
fit$xlevels
forlm
andfit$contrasts
forglmmPQL
). At least it seems to be consistent acrosslm
related models.听起来您可能喜欢随机效果。看看像 glmer (lme4 包)这样的东西。使用贝叶斯模型,当估计效果时可使用的信息很少时,您将获得接近 0 的效果。但警告您必须自己进行预测,而不是使用 Predict()。
或者,您可以简单地为要包含在模型中的级别创建虚拟变量,例如变量 0/1 表示星期一,一个变量表示星期二,一个变量表示星期三等。如果星期日包含所有内容,则星期日将自动从模型中删除0 的。但其他数据中星期日列的值为 1 不会导致预测步骤失败。它只是假设周日的影响与其他天的平均影响相同(这可能是真的,也可能不是)。
Sounds like you might like random effects. Look into something like glmer (lme4 package). With a Bayesian model, you'll get effects that approach 0 when there's little information to use when estimating them. Warning, though, that you'll have to do prediction yourself, rather than using predict().
Alternatively, you can simply make dummy variables for the levels you want to include in the model, e.g. a variable 0/1 for Monday, one for Tuesday, one for Wednesday, etc. Sunday will be automatically removed from the model if it contains all 0's. But having a 1 in the Sunday column in the other data won't fail the prediction step. It will just assume that Sunday has an effect that's average the other days (which may or may not be true).
线性/逻辑回归的假设之一是很少或没有多重共线性;因此,如果预测变量在理想情况下彼此独立,则模型不需要查看所有可能的因素水平。新的因子水平(D)是新的预测因子,可以设置为NA而不影响其余因子A、B、C的预测能力。这就是为什么模型仍然能够做出预测。但添加新的 D 级会破坏预期的模式。这就是整个问题。设置 NA 可以解决这个问题。
One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.
如果您在调用
predict
时设置标志allow.new.levels=TRUE
,则lme4
包将处理新级别。示例:如果您的星期几因子位于变量
dow
和分类结果b_fail
中,您可以运行M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=二项式(link='logit'))
M0.preds <- 预测(M0, df.new.data,allow.new.levels=TRUE)
这是一个随机效应逻辑回归的示例。当然,您可以执行常规回归……或大多数 GLM 模型。如果您想进一步沿着贝叶斯路径前进,请查看 Gelman & Hill 的优秀著作和 Stan 基础设施。
The
lme4
package will handle new levels if you set the flagallow.new.levels=TRUE
when callingpredict
.Example: if your day of week factor is in a variable
dow
and a categorical outcomeb_fail
, you could runM0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=binomial(link='logit'))
M0.preds <- predict(M0, df.new.data, allow.new.levels=TRUE)
This is an example with a random effects logistic regression. Of course, you can perform regular regression ... or most GLM models. If you want to head further down the Bayesian path, look at Gelman & Hill's excellent book and the Stan infrastructure.
拆分测试的一个快速而肮脏的解决方案是将稀有值重新编码为“其他”。下面是一个实现:
例如,对于 data.table,调用将类似于:
其中
xcols
是colnames(dt)
的任意子集。A quick-and-dirty solution for split testing, is to recode rare values as "other". Here is an implementation:
For example, with data.table, the call would be something like:
where
xcols
is a any subset ofcolnames(dt)
.