将随机森林算法应用于包含缺失值的数据集

发布于 2025-01-20 09:28:36 字数 1705 浏览 0 评论 0原文

我想将随机森林算法从软件包mlr应用于数据集。这是来自软件包mlbench的Zoo数据集。

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)

但是在此之前，我已经引入了随机NAS，只有目标变量类型我已经完成了。

zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib

在测试随机森林算法之前，我通过简单的RPART决策树算法运行了使用NAS的动物园数据集。由于其参数“ MaxSurogate”或“ useUrroAgte”，因此可以使用NAS处理数据集。因此，我可以通过数据集将其传递而没有任何问题，并且代码是没有任何问题的。

接下来，我想使用上述随机森林算法。

forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- tuneParams(forest, task = zooTask,
                          resampling = cvForTuning,
                          par.set = forestParamSpace,
                          control = randSearch)

但是，一旦我想运行参数调谐过程，我就会收到错误消息：

“ checklearnerbeforetrain中的错误（任务，学习者，权重）：任务 “ Zootib”在“头发，羽毛，鸡蛋，牛奶，空气中”中缺少价值观水生，...'，但是学习者'classif.randomforest'不支持那！”

是奇怪的，因为一个随机的森林只是“仅”几个决策树的合奏 - 进而可以处理缺失的值。我以前对其进行了Acuthip的尝试，并且RPART算法效果很好。

我想尝试设置该设置替代参数，但是对于一个随机的森林，当我执行函数> getParamset（forest）不幸的是，没有替代拆分

可能会出现在某种程度上。随机森林。

原文

I would like to apply the Random Forest algorithm from the package mlr to a data set. This is the Zoo dataset from the package mlbench.

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)

But before that I have introduced random NAs, only the target variable type I have left complete.

zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib

Before testing the random forest algorithm, I ran the zoo dataset with the NAs through a simple RPART decision tree algorithm. This has the possibility to process datasets with NAs due to its parameters "maxsurrogate" or "usesurroagte". So I could pass the dataset without any problems and the code was executed without any problems.

Next I wanted to use the above mentioned random forest algorithm.

forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- tuneParams(forest, task = zooTask,
                          resampling = cvForTuning,
                          par.set = forestParamSpace,
                          control = randSearch)

However, as soon as I wanted to run the parameter tuning process I got the error message:

"Error in checkLearnerBeforeTrain(task, learner, weights) : Task
'zooTib' has missing values in 'hair, feathers, eggs, milk, airborne,
aquatic, ...', but learner 'classif.randomForest' does not support
that!"

This is strange, since a Random Forest is "merely" an ensemble of several Decision Trees - which in turn can handle Missing Values. I acutally tried it before and the RPART algorithm worked perfectly fine.

I wanted to try to set the Surrogate Split parameter, but for a Random Forest this setting does not exist. When I execute the function getParamSet(forest) unfortunately no surrogate splits appear there.

Is there a possibility to somehow pass records containing NAs to a random forest.

分享到QQ

分享到微博