将随机森林算法应用于包含缺失值的数据集
我想将随机森林算法从软件包mlr
应用于数据集。这是来自软件包mlbench
的Zoo
数据集。
data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)
但是在此之前,我已经引入了随机NAS,只有目标变量类型
我已经完成了。
zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib
在测试随机森林算法之前,我通过简单的RPART决策树算法运行了使用NAS的动物园数据集。由于其参数“ MaxSurogate”或“ useUrroAgte”,因此可以使用NAS处理数据集。因此,我可以通过数据集将其传递而没有任何问题,并且代码是没有任何问题的。
接下来,我想使用上述随机森林算法。
forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)
tunedForestPars <- tuneParams(forest, task = zooTask,
resampling = cvForTuning,
par.set = forestParamSpace,
control = randSearch)
但是,一旦我想运行参数调谐过程,我就会收到错误消息:
“ checklearnerbeforetrain中的错误(任务,学习者,权重):任务 “ Zootib”在“头发,羽毛,鸡蛋,牛奶,空气中”中缺少价值观 水生,...',但是学习者'classif.randomforest'不支持 那!”
是奇怪的,因为一个随机的森林只是“仅”几个决策树的合奏 - 进而可以处理缺失的值。我以前对其进行了Acuthip的尝试,并且RPART算法效果很好。
我想尝试设置该设置替代参数,但是对于一个随机的森林,当我执行函数> getParamset(forest)不幸的是,没有替代拆分
可能会出现在某种程度上 。随机森林。
I would like to apply the Random Forest algorithm from the package mlr
to a data set. This is the Zoo
dataset from the package mlbench
.
data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)
But before that I have introduced random NAs, only the target variable type
I have left complete.
zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib
Before testing the random forest algorithm, I ran the zoo dataset with the NAs through a simple RPART decision tree algorithm. This has the possibility to process datasets with NAs due to its parameters "maxsurrogate" or "usesurroagte". So I could pass the dataset without any problems and the code was executed without any problems.
Next I wanted to use the above mentioned random forest algorithm.
forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)
tunedForestPars <- tuneParams(forest, task = zooTask,
resampling = cvForTuning,
par.set = forestParamSpace,
control = randSearch)
However, as soon as I wanted to run the parameter tuning process I got the error message:
"Error in checkLearnerBeforeTrain(task, learner, weights) : Task
'zooTib' has missing values in 'hair, feathers, eggs, milk, airborne,
aquatic, ...', but learner 'classif.randomForest' does not support
that!"
This is strange, since a Random Forest is "merely" an ensemble of several Decision Trees - which in turn can handle Missing Values. I acutally tried it before and the RPART algorithm worked perfectly fine.
I wanted to try to set the Surrogate Split parameter, but for a Random Forest this setting does not exist. When I execute the function getParamSet(forest)
unfortunately no surrogate splits appear there.
Is there a possibility to somehow pass records containing NAs to a random forest.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论