我正在拟合一个模型来分解数据并进行预测。如果 predict.lm() 中的 newdata 包含模型未知的单个因子水平,则 predict.lm 的全部 () 失败并返回错误。

有没有一种好方法可以让 predict.lm() 返回模型已知的因子水平的预测以及未知因子水平的 NA,而不仅仅是错误?


foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo) <- data.frame(predictor=as.factor(c("A","B","C","D")))

我希望最后一个命令返回对应于因子级别“A”、“B”和“C”的三个“真实”预测以及对应于未知级别“D”的 NA

> id <- which(!($predictor %in% levels(foo$predictor)))
>$predictor[id] <- NA
> predict(model,
         1          2          3          4 
-0.1676941 -0.6454521  0.4524391         NA 

这是一种更通用的方法,它将把原始数据中未出现的所有级别设置为 NA。正如 Hadley 在评论中提到的,他们本可以选择将其包含在 predict() 函数中,但他们没有。


model.matrix(~predictor,data=foo) %*% coef(model)
1 -0.1676941
2 -0.6454521
3  0.4524391

在底部有两个模型矩阵。您会看到 的那个有一个额外的列,因此您不能再使用矩阵计算。如果您使用新的数据集进行建模,您还将获得一个不同的模型,该模型在额外级别上具有额外的虚拟变量。

> model.matrix(~predictor,data=foo)
  (Intercept) predictorB predictorC
1           1          0          0
2           1          1          0
3           1          0          1
[1] 0 1 1
[1] "contr.treatment"

> model.matrix(~predictor,
  (Intercept) predictorB predictorC predictorD
1           1          0          0          0
2           1          1          0          0
3           1          0          1          0
4           1          0          0          1
[1] 0 1 1 1
[1] "contr.treatment"

您也不能只删除模型矩阵中的最后一列,因为即使您这样做,其他两个级别仍然会受到影响。级别 A 的代码为 (0,0)。对于 B 来说是 (1,0),对于 C 来说是 (0,1) ...对于 D 来说又是 (0 ,0)!因此,如果您的模型天真地删除最后一个虚拟变量,则它会假设 AD 是同一级别。


On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.

如果您想在创建 lm 模型之后但在调用预测之前处理数据中缺失的级别(假设我们事先不知道可能会丢失哪些级别),这里是我构建的函数,用于设置不在模型到 NA - 预测也将给出 NA,然后您可以使用替代方法来预测这些值。

object 将是 lm(...,data=trainData) 的 lm 输出

data 将是您要为其创建预测的数据框


  #Obtain factor predictors in the model and their levels ------------------

  factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))

  #Select column names in your data that are factor predictors in your model -----

  predictors<-names(data[names(data) %in% factors])

  #For each factor predictor in your data if the level is not in the model set the value to NA --------------

  for (i in 1:length(predictors)){
    found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
    if (any(!found)) data[!found,predictors[i]]<-NA



MorgenBall 整理并扩展了该功能。现在它也在 sperrorest 中实现。


  • 会降低未使用的因子级别,而不仅仅是将缺失值设置为NA
  • 向用户发出一条消息,表明因子级别已被删除,
  • 检查 test_data 中是否存在因子变量,如果不存在,则返回原始 data.frame
  • 不仅适用于 lmglm 以及 glmmPQL


#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#' @param fit fitted model on training data
#' @param test_data data to make predictions for
#' @return data.frame with matching factor levels to fitted model
#' @keywords internal
#' @export
remove_missing_levels <- function(fit, test_data) {


  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>% -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
    # do nothing if no factors are present
    if (length(factors) == 0) {

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <-, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
    # do nothing if no factors are present
    if (length(factors) == 0) {

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <-, factor_levels))

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),


predict(model,newdata=remove_missing_levels (fit=model,

在尝试改进此函数时,我发现 SL 学习方法如 lmglm 等。火车和火车需要相同的水平如果删除级别,则测试 ML 学习方法(svmrandomForest)会失败。这些方法需要所有级别的训练和训练。测试。

通用的解决方案很难实现,因为每个拟合模型都有不同的方式来存储其因子级别组件(fit$xlevels 用于 lmfit$contrasts< /code> 用于 glmmPQL)。至少它在 lm 相关模型中似乎是一致的。

听起来您可能喜欢随机效果。看看像 glmer (lme4 包)这样的东西。使用贝叶斯模型,当估计效果时可使用的信息很少时,您将获得接近 0 的效果。但警告您必须自己进行预测,而不是使用 Predict()。

或者,您可以简单地为要包含在模型中的级别创建虚拟变量,例如变量 0/1 表示星期一,一个变量表示星期二,一个变量表示星期三等。如果星期日包含所有内容,则星期日将自动从模型中删除0 的。但其他数据中星期日列的值为 1 不会导致预测步骤失败。它只是假设周日的影响与其他天的平均影响相同(这可能是真的,也可能不是)。

线性/逻辑回归的假设之一是很少或没有多重共线性;因此,如果预测变量在理想情况下彼此独立,则模型不需要查看所有可能的因素水平。新的因子水平(D)是新的预测因子,可以设置为NA而不影响其余因子A、B、C的预测能力。这就是为什么模型仍然能够做出预测。但添加新的 D 级会破坏预期的模式。这就是整个问题。设置 NA 可以解决这个问题。

One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.

如果您在调用 predict 时设置标志,则 lme4 包将处理新级别。

示例:如果您的星期几因子位于变量 dow 和分类结果 b_fail 中,您可以运行

M0 <- lmer(b_fail ~ x + (1 | dow),, family=二项式(link='logit'))
M0.preds <- 预测(M0,,

这是一个随机效应逻辑回归的示例。当然,您可以执行常规回归……或大多数 GLM 模型。如果您想进一步沿着贝叶斯路径前进,请查看 Gelman & Hill 的优秀著作和 Stan 基础设施。

rare_to_other <- function(x, fault_factor = 1e6) {
  # dirty dealing with rare levels:
  # recode small cells as "other" before splitting to train/test,
  # assuring that lopsided split occurs with prob < 1/fault_factor
  # (N.b. not fully kosher, but useful for quick and dirty exploratory).

  if (is.factor(x) | is.character(x)) {
    min.cell.size = log(fault_factor, 2) + 1
    xfreq <- sort(table(x), dec = T)
    rare_levels <- names(which(xfreq < min.cell.size))
    if (length(rare_levels) == length(unique(x))) {
      warning("all levels are rare and recorded as other. make sure this is desirable")
    if (length(rare_levels) > 0) {
      message("recoding rare levels")
      if (is.factor(x)) {
        altx <- as.character(x)
        altx[altx %in% rare_levels] <- "other"
        x <- as.factor(altx)
      } else {
        # is.character(x)
        x[x %in% rare_levels] <- "other"
    } else {
      message("no rare levels encountered")
  } else {
    message("x is neither a factor nor a character, doing nothing")

例如,对于 data.table,调用将类似于:

dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other

其中 xcolscolnames(dt) 的任意子集。

