为循环中的交叉验证准备测试/训练集
我正在尝试建立测试和训练组来进行交叉验证。我总共有 95 个个人 ID,并尝试像这样完成任务:
# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)
# specify Kfold
n.folds <- 5
folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds))
samples.train <- list()
samples.test <- list()
foldSet <- list()
kfold.df <- data.frame("IID" = indv)
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in%
samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}
目标是准备 5 个测试和训练样本集来进行建模。但我遇到了这个错误消息:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 95, 2
此外,虽然 samples.train
和 samples.test
是正确的,但 foldSet
输出并不符合预期。你能帮我让这个循环工作吗?
更新: 这是在创建 foldSet
时不使用通配符的 for 循环:
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}
通过执行循环,您会发现 kfold.df
作为列出所有五个折叠测试/训练随机集的数据帧。我期望每次迭代都会创建与 f 相对应的测试和训练集,因此,在五次迭代之后,我将可以访问每个折叠的训练/测试集以进行循环内的下一个操作,例如kfold.df[foldSet == "train", "IID"]
。我需要此访问权限,因为我想使用它根据每个折叠的训练和测试 invd
来对另一个更大的矩阵进行子集化,准备将其应用于回归模型。这就是为什么我使用 foldSet
的通配符来使循环能够自行创建,但我无法管理它。
I am trying to build Test and Train groups for doing the Cross Validation. I have a total individuals pool of 95 invidual IDs and tried to make the task done like this:
# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)
# specify Kfold
n.folds <- 5
folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds))
samples.train <- list()
samples.test <- list()
foldSet <- list()
kfold.df <- data.frame("IID" = indv)
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in%
samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}
The goal is preparing 5 testing and training sets of samples to do the modeling. But I have encountered with this error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 95, 2
Besides, the foldSet
output is not as expected, although samples.train
and samples.test
are correct. Could you please help me to make this loop working!
UPDATE:
Here is the for-loop without using wildcards in creating foldSet
:
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}
By executing the loop you will find kfold.df
as a dataframe listing all five folds test/train random sets. I expect for each iteration, creating the testing and training sets corresponding to the f
, so, after five iteration, I would have access to each fold's Training/Testing sets for the next operations inside the loop, like kfold.df[foldSet == "train", "IID"]
. I need this access bcoz I want to use it for subsetting another bigger matrix based on train and test invd
of each fold, preparing it for applying to the regression model. That's why I used the wildcards for foldSet
to make the loop able creating all by itself but I failed to manage it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为你可能把事情变得过于复杂了(这是我一直在做的事情......)
你不需要竭尽全力去制作你想要制作的东西。这个答案分为三个部分。
第 1 部分
如果我理解正确,这就是您正在寻找的内容(减去字符串)。我还介绍了如何将它与实际数据一起使用。
第 2 部分
您可以创建随机折叠序列并将其与模型一起使用,无需将它们堆叠在数据框中。您必须循环访问模型相同的次数,为什么不同时进行呢?
要检查模型性能:
预测并检查测试性能
第 3 部分
这里我将使用
caret
库。然而,还有很多其他选择。如果您想查看每个折叠的性能,您也可以这样做。
I think you may be overcomplicating things (which is something I do all the time...)
You don't need to go to great lengths to make what you are trying to make. This answer is broken down into three parts.
Part 1
If I understand correctly, this is about what you're looking for (less the strings). I also included how you might use it with your actual data.
Part 2
You could just create your random fold sequence and use that with a model, you don't need to stack them in a data frame. You have to loop through the model the same number of times, why not do it at the same time?
To check out the model performance:
Predict and inspect the testing performance
Part 3
Here I'm going to use the
caret
library. However, there are a lot of other options.If you want to see each of the folds' performance, you can do that, too.