将随机森林算法应用于包含缺失值的数据集

发布于 2025-01-20 09:28:36 字数 1705 浏览 0 评论 0原文

我想将随机森林算法从软件包mlr应用于数据集。这是来自软件包mlbenchZoo数据集。

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)

但是在此之前,我已经引入了随机NAS,只有目标变量类型我已经完成了。

zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib

在测试随机森林算法之前,我通过简单的RPART决策树算法运行了使用NAS的动物园数据集。由于其参数“ MaxSurogate”或“ useUrroAgte”,因此可以使用NAS处理数据集。因此,我可以通过数据集将其传递而没有任何问题,并且代码是没有任何问题的。

接下来,我想使用上述随机森林算法。

forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- tuneParams(forest, task = zooTask,
                          resampling = cvForTuning,
                          par.set = forestParamSpace,
                          control = randSearch)

但是,一旦我想运行参数调谐过程,我就会收到错误消息:

“ checklearnerbeforetrain中的错误(任务,学习者,权重):任务 “ Zootib”在“头发,羽毛,鸡蛋,牛奶,空气中”中缺少价值观 水生,...',但是学习者'classif.randomforest'不支持 那!”

是奇怪的,因为一个随机的森林只是“仅”几个决策树的合奏 - 进而可以处理缺失的值。我以前对其进行了Acuthip的尝试,并且RPART算法效果很好。

我想尝试设置该设置替代参数,但是对于一个随机的森林,当我执行函数> getParamset(forest)不幸的是,没有替代拆分

可能会出现在某种程度上 。随机森林。

I would like to apply the Random Forest algorithm from the package mlr to a data set. This is the Zoo dataset from the package mlbench.

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)

But before that I have introduced random NAs, only the target variable type I have left complete.

zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib

Before testing the random forest algorithm, I ran the zoo dataset with the NAs through a simple RPART decision tree algorithm. This has the possibility to process datasets with NAs due to its parameters "maxsurrogate" or "usesurroagte". So I could pass the dataset without any problems and the code was executed without any problems.

Next I wanted to use the above mentioned random forest algorithm.

forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- tuneParams(forest, task = zooTask,
                          resampling = cvForTuning,
                          par.set = forestParamSpace,
                          control = randSearch)

However, as soon as I wanted to run the parameter tuning process I got the error message:

"Error in checkLearnerBeforeTrain(task, learner, weights) : Task
'zooTib' has missing values in 'hair, feathers, eggs, milk, airborne,
aquatic, ...', but learner 'classif.randomForest' does not support
that!"

This is strange, since a Random Forest is "merely" an ensemble of several Decision Trees - which in turn can handle Missing Values. I acutally tried it before and the RPART algorithm worked perfectly fine.

I wanted to try to set the Surrogate Split parameter, but for a Random Forest this setting does not exist. When I execute the function getParamSet(forest) unfortunately no surrogate splits appear there.

Is there a possibility to somehow pass records containing NAs to a random forest.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文