测试数据中因子水平未知的 Predict.lm()

发布于 2024-10-04 17:39:18 字数 556 浏览 9 评论 0原文

我正在拟合一个模型来分解数据并进行预测。如果 predict.lm() 中的 newdata 包含模型未知的单个因子水平,则 predict.lm 的全部 () 失败并返回错误。

有没有一种好方法可以让 predict.lm() 返回模型已知的因子水平的预测以及未知因子水平的 NA,而不仅仅是错误?

示例代码:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

我希望最后一个命令返回对应于因子级别“A”、“B”和“C”的三个“真实”预测以及对应于未知级别“D”的 NA

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.

Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?

Example code:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

如此安好 2024-10-11 17:39:18

您必须在进行任何计算之前删除额外的级别,例如:

> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
         1          2          3          4 
-0.1676941 -0.6454521  0.4524391         NA 

这是一种更通用的方法,它将把原始数据中未出现的所有级别设置为 NA。正如 Hadley 在评论中提到的,他们本可以选择将其包含在 predict() 函数中,但他们没有。

如果您查看计算本身,为什么必须这样做就变得显而易见。在内部,预测计算如下:

model.matrix(~predictor,data=foo) %*% coef(model)
        [,1]
1 -0.1676941
2 -0.6454521
3  0.4524391

在底部有两个模型矩阵。您会看到 foo.new 的那个有一个额外的列,因此您不能再使用矩阵计算。如果您使用新的数据集进行建模,您还将获得一个不同的模型,该模型在额外级别上具有额外的虚拟变量。

> model.matrix(~predictor,data=foo)
  (Intercept) predictorB predictorC
1           1          0          0
2           1          1          0
3           1          0          1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

> model.matrix(~predictor,data=foo.new)
  (Intercept) predictorB predictorC predictorD
1           1          0          0          0
2           1          1          0          0
3           1          0          1          0
4           1          0          0          1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

您也不能只删除模型矩阵中的最后一列,因为即使您这样做,其他两个级别仍然会受到影响。级别 A 的代码为 (0,0)。对于 B 来说是 (1,0),对于 C 来说是 (0,1) ...对于 D 来说又是 (0 ,0)!因此,如果您的模型天真地删除最后一个虚拟变量,则它会假设 AD 是同一级别。

在更理论的部分:可以在没有所有级别的情况下构建模型。现在,正如我之前尝试解释的那样,该模型对于您在构建模型时使用的级别有效。如果你遇到新的关卡,你必须构建一个新的模型来包含额外的信息。如果您不这样做,您唯一能做的就是从数据集中删除额外的级别。但随后您基本上会丢失其中包含的所有信息,因此通常不被认为是好的做法。

You have to remove the extra levels before any calculation, like:

> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
         1          2          3          4 
-0.1676941 -0.6454521  0.4524391         NA 

This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't

Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :

model.matrix(~predictor,data=foo) %*% coef(model)
        [,1]
1 -0.1676941
2 -0.6454521
3  0.4524391

At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.

> model.matrix(~predictor,data=foo)
  (Intercept) predictorB predictorC
1           1          0          0
2           1          1          0
3           1          0          1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

> model.matrix(~predictor,data=foo.new)
  (Intercept) predictorB predictorC predictorD
1           1          0          0          0
2           1          1          0          0
3           1          0          1          0
4           1          0          0          1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.

On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.

南街九尾狐 2024-10-11 17:39:18

如果您想在创建 lm 模型之后但在调用预测之前处理数据中缺失的级别(假设我们事先不知道可能会丢失哪些级别),这里是我构建的函数,用于设置不在模型到 NA - 预测也将给出 NA,然后您可以使用替代方法来预测这些值。

object 将是 lm(...,data=trainData) 的 lm 输出

data 将是您要为其创建预测的数据框

missingLevelsToNA<-function(object,data){

  #Obtain factor predictors in the model and their levels ------------------

  factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))
  factorLevels<-unname(unlist(object$xlevels))
  modelFactors<-as.data.frame(cbind(factors,factorLevels))


  #Select column names in your data that are factor predictors in your model -----

  predictors<-names(data[names(data) %in% factors])


  #For each factor predictor in your data if the level is not in the model set the value to NA --------------

  for (i in 1:length(predictors)){
    found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
    if (any(!found)) data[!found,predictors[i]]<-NA
  }

  data

}

If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.

object will be your lm output from lm(...,data=trainData)

data will be the data frame you want to create predictions for

missingLevelsToNA<-function(object,data){

  #Obtain factor predictors in the model and their levels ------------------

  factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))
  factorLevels<-unname(unlist(object$xlevels))
  modelFactors<-as.data.frame(cbind(factors,factorLevels))


  #Select column names in your data that are factor predictors in your model -----

  predictors<-names(data[names(data) %in% factors])


  #For each factor predictor in your data if the level is not in the model set the value to NA --------------

  for (i in 1:length(predictors)){
    found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
    if (any(!found)) data[!found,predictors[i]]<-NA
  }

  data

}
昔梦 2024-10-11 17:39:18

MorgenBall 整理并扩展了该功能。现在它也在 sperrorest 中实现。

其他功能

  • 会降低未使用的因子级别,而不仅仅是将缺失值设置为NA
  • 向用户发出一条消息,表明因子级别已被删除,
  • 检查 test_data 中是否存在因子变量,如果不存在,则返回原始 data.frame
  • 不仅适用于 lmglm 以及 glmmPQL

注意:此处显示的函数可能会随着时间的推移而改变(改进)。

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://stackoverflow.com/a/39495480/4185785

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

我们可以将此函数应用到问题中的示例中,如下所示:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

在尝试改进此函数时,我发现 SL 学习方法如 lmglm 等。火车和火车需要相同的水平如果删除级别,则测试 ML 学习方法(svmrandomForest)会失败。这些方法需要所有级别的训练和训练。测试。

通用的解决方案很难实现,因为每个拟合模型都有不同的方式来存储其因子级别组件(fit$xlevels 用于 lmfit$contrasts< /code> 用于 glmmPQL)。至少它在 lm 相关模型中似乎是一致的。

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

Additional features

  • drops unused factor levels rather than just setting the missing values to NA.
  • issues a message to the user that factor levels have been dropped
  • checks for existence of factor variables in test_data and returns original data.frame if non are present
  • works not only for lm, glm and but also for glmmPQL

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://stackoverflow.com/a/39495480/4185785

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

童话里做英雄 2024-10-11 17:39:18

听起来您可能喜欢随机效果。看看像 glmer (lme4 包)这样的东西。使用贝叶斯模型,当估计效果时可使用的信息很少时,您将获得接近 0 的效果。但警告您必须自己进行预测,而不是使用 Predict()。

或者,您可以简单地为要包含在模型中的级别创建虚拟变量,例如变量 0/1 表示星期一,一个变量表示星期二,一个变量表示星期三等。如果星期日包含所有内容,则星期日将自动从模型中删除0 的。但其他数据中星期日列的值为 1 不会导致预测步骤失败。它只是假设周日的影响与其他天的平均影响相同(这可能是真的,也可能不是)。

Sounds like you might like random effects. Look into something like glmer (lme4 package). With a Bayesian model, you'll get effects that approach 0 when there's little information to use when estimating them. Warning, though, that you'll have to do prediction yourself, rather than using predict().

Alternatively, you can simply make dummy variables for the levels you want to include in the model, e.g. a variable 0/1 for Monday, one for Tuesday, one for Wednesday, etc. Sunday will be automatically removed from the model if it contains all 0's. But having a 1 in the Sunday column in the other data won't fail the prediction step. It will just assume that Sunday has an effect that's average the other days (which may or may not be true).

半仙 2024-10-11 17:39:18

线性/逻辑回归的假设之一是很少或没有多重共线性;因此,如果预测变量在理想情况下彼此独立,则模型不需要查看所有可能的因素水平。新的因子水平(D)是新的预测因子,可以设置为NA而不影响其余因子A、B、C的预测能力。这就是为什么模型仍然能够做出预测。但添加新的 D 级会破坏预期的模式。这就是整个问题。设置 NA 可以解决这个问题。

One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.

娇妻 2024-10-11 17:39:18

如果您在调用 predict 时设置标志 allow.new.levels=TRUE,则 lme4 包将处理新级别。

示例:如果您的星期几因子位于变量 dow 和分类结果 b_fail 中,您可以运行


M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=二项式(link='logit'))
M0.preds <- 预测(M0, df.new.data,allow.new.levels=TRUE)

这是一个随机效应逻辑回归的示例。当然,您可以执行常规回归……或大多数 GLM 模型。如果您想进一步沿着贝叶斯路径前进,请查看 Gelman & Hill 的优秀著作和 Stan 基础设施。

The lme4 package will handle new levels if you set the flag allow.new.levels=TRUE when calling predict.

Example: if your day of week factor is in a variable dow and a categorical outcome b_fail, you could run


M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=binomial(link='logit'))
M0.preds <- predict(M0, df.new.data, allow.new.levels=TRUE)

This is an example with a random effects logistic regression. Of course, you can perform regular regression ... or most GLM models. If you want to head further down the Bayesian path, look at Gelman & Hill's excellent book and the Stan infrastructure.

梦与时光遇 2024-10-11 17:39:18

拆分测试的一个快速而肮脏的解决方案是将稀有值重新编码为“其他”。下面是一个实现:

rare_to_other <- function(x, fault_factor = 1e6) {
  # dirty dealing with rare levels:
  # recode small cells as "other" before splitting to train/test,
  # assuring that lopsided split occurs with prob < 1/fault_factor
  # (N.b. not fully kosher, but useful for quick and dirty exploratory).

  if (is.factor(x) | is.character(x)) {
    min.cell.size = log(fault_factor, 2) + 1
    xfreq <- sort(table(x), dec = T)
    rare_levels <- names(which(xfreq < min.cell.size))
    if (length(rare_levels) == length(unique(x))) {
      warning("all levels are rare and recorded as other. make sure this is desirable")
    }
    if (length(rare_levels) > 0) {
      message("recoding rare levels")
      if (is.factor(x)) {
        altx <- as.character(x)
        altx[altx %in% rare_levels] <- "other"
        x <- as.factor(altx)
        return(x)
      } else {
        # is.character(x)
        x[x %in% rare_levels] <- "other"
        return(x)
      }
    } else {
      message("no rare levels encountered")
      return(x)
    }
  } else {
    message("x is neither a factor nor a character, doing nothing")
    return(x)
  }
}

例如,对于 data.table,调用将类似于:

dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other

其中 xcolscolnames(dt) 的任意子集。

A quick-and-dirty solution for split testing, is to recode rare values as "other". Here is an implementation:

rare_to_other <- function(x, fault_factor = 1e6) {
  # dirty dealing with rare levels:
  # recode small cells as "other" before splitting to train/test,
  # assuring that lopsided split occurs with prob < 1/fault_factor
  # (N.b. not fully kosher, but useful for quick and dirty exploratory).

  if (is.factor(x) | is.character(x)) {
    min.cell.size = log(fault_factor, 2) + 1
    xfreq <- sort(table(x), dec = T)
    rare_levels <- names(which(xfreq < min.cell.size))
    if (length(rare_levels) == length(unique(x))) {
      warning("all levels are rare and recorded as other. make sure this is desirable")
    }
    if (length(rare_levels) > 0) {
      message("recoding rare levels")
      if (is.factor(x)) {
        altx <- as.character(x)
        altx[altx %in% rare_levels] <- "other"
        x <- as.factor(altx)
        return(x)
      } else {
        # is.character(x)
        x[x %in% rare_levels] <- "other"
        return(x)
      }
    } else {
      message("no rare levels encountered")
      return(x)
    }
  } else {
    message("x is neither a factor nor a character, doing nothing")
    return(x)
  }
}

For example, with data.table, the call would be something like:

dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other

where xcols is a any subset of colnames(dt).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文