为循环中的交叉验证准备测试/训练集

发布于 2025-01-09 03:56:14 字数 1798 浏览 0 评论 0原文

我正在尝试建立测试和训练组来进行交叉验证。我总共有 95 个个人 ID，并尝试像这样完成任务：

# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)

# specify Kfold
n.folds <- 5

folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds)) 

samples.train <- list()
samples.test <- list()
foldSet <- list()

kfold.df <- data.frame("IID" = indv)

for (f in 1:n.folds) {
          samples.train[[f]] <- indv[folds != f]
          samples.test[[f]] <- indv[folds == f]

# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in% 
                  samples.test[[f]], "test", "train")

# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}

目标是准备 5 个测试和训练样本集来进行建模。但我遇到了这个错误消息：

Error in data.frame(..., check.names = FALSE) : 
arguments imply differing number of rows: 95, 2

此外，虽然 samples.train 和 samples.test 是正确的，但 foldSet 输出并不符合预期。你能帮我让这个循环工作吗？

更新：这是在创建 foldSet 时不使用通配符的 for 循环：

for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]

foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}

通过执行循环，您会发现 kfold.df 作为列出所有五个折叠测试/训练随机集的数据帧。我期望每次迭代都会创建与 f 相对应的测试和训练集，因此，在五次迭代之后，我将可以访问每个折叠的训练/测试集以进行循环内的下一个操作，例如kfold.df[foldSet == "train", "IID"]。我需要此访问权限，因为我想使用它根据每个折叠的训练和测试 invd 来对另一个更大的矩阵进行子集化，准备将其应用于回归模型。这就是为什么我使用 foldSet 的通配符来使循环能够自行创建，但我无法管理它。

原文

I am trying to build Test and Train groups for doing the Cross Validation. I have a total individuals pool of 95 invidual IDs and tried to make the task done like this:

# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)

# specify Kfold
n.folds <- 5

folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds)) 

samples.train <- list()
samples.test <- list()
foldSet <- list()

kfold.df <- data.frame("IID" = indv)

for (f in 1:n.folds) {
          samples.train[[f]] <- indv[folds != f]
          samples.test[[f]] <- indv[folds == f]

# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in% 
                  samples.test[[f]], "test", "train")

# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}

The goal is preparing 5 testing and training sets of samples to do the modeling. But I have encountered with this error message:

Error in data.frame(..., check.names = FALSE) : 
arguments imply differing number of rows: 95, 2

Besides, the foldSet output is not as expected, although samples.train and samples.test are correct. Could you please help me to make this loop working!

UPDATE:
Here is the for-loop without using wildcards in creating foldSet :

for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]

foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}

By executing the loop you will find kfold.df as a dataframe listing all five folds test/train random sets. I expect for each iteration, creating the testing and training sets corresponding to the f, so, after five iteration, I would have access to each fold's Training/Testing sets for the next operations inside the loop, like kfold.df[foldSet == "train", "IID"]. I need this access bcoz I want to use it for subsetting another bigger matrix based on train and test invd of each fold, preparing it for applying to the regression model. That's why I used the wildcards for foldSet to make the loop able creating all by itself but I failed to manage it.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

享受孤独 2025-01-16 03:56:14

我认为你可能把事情变得过于复杂了（这是我一直在做的事情......）

你不需要竭尽全力去制作你想要制作的东西。这个答案分为三个部分。

构建您正在寻找的数据框架（我认为！）
为什么您真的不需要构建此数据框架
为什么不使用已经存在的数据框架？

第 1 部分

如果我理解正确，这就是您正在寻找的内容（减去字符串）。我还介绍了如何将它与实际数据一起使用。

library(tidyverse)

giveMe <- function(rowCt, nfolds){
  # set.seed(235) # removed seed after establishing working function to incite
  #  the expected randomness

  folds <- cut(1:rowCt, breaks = nfolds, labels = F)
  # randomise the folds
  folds <- sample(folds, length(folds)) 
  # create the folds' sets
  kfold.df <- map_dfc(1:nfolds,
                      ~ifelse(folds != .x, T, F)) %>% 
  setNames(., paste0("foldSet_",1:nfolds)) %>%  # name each field
  add_column(IID = 1:rowCt, .before = 1) # add indices to the left

  return(kfold.df) # return a data frame
}

given <- giveMe(95, 5)

giveMore <- giveMe(nrow(iris), 5) # uses the built-in iris data set

第 2 部分

您可以创建随机折叠序列并将其与模型一起使用，无需将它们堆叠在数据框中。您必须循环访问模型相同的次数，为什么不同时进行呢？

folds <- sample(cut(1:nrow(iris), 5, # no seed-- random on purpose
                    labels = F))

tellMe <- map(1:5, # the folds start in col 2
              ~lm(Sepal.Length~., 
                  iris[ifelse(folds != .x,
                              T, F), 
                       1:4])) # dropped 'Species' groups' issue

要检查模型性能：

map_dfr(1:5, .f = function(x){
  y = tellMe[[x]]
  sigma = sigma(y)
  rsq = summary(y)$adj.r.squared
  c(sigma = sigma, rsq = rsq)
})
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.334 0.844
# 2 0.309 0.869
# 3 0.302 0.846
# 4 0.330 0.847
# 5 0.295 0.872

预测并检查测试性能

# create a list of the predictec values from the test data
showMe <- map(1:5,
              ~predict(tellMe[[.x]], 
                       iris[ifelse(folds == .x,
                                   T, F), 1:4]))

# Grab comparable metrics like those from the models
map_dfr(1:5,
        .f = function(x){
          A = iris[ifelse(folds == x, T, F), ]$Sepal.Length
          P = showMe[[x]]
          sigma = sqrt(sum((A - P)^2) / length(A))
          rsq = cor(A, P)^2
          c(sigma = sigma, rsq = rsq)
        })
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.232 0.919
# 2 0.342 0.774
# 3 0.366 0.884
# 4 0.250 0.906
# 5 0.384 0.790

第 3 部分

这里我将使用 caret 库。然而，还有很多其他选择。

library(caret)

set.seed(1)
# split training and testing 70/30%
tr <- createDataPartition(iris$Species, p = .7, list = F)

# set up 5-fold val
trC <- trainControl(method = "cv", number = 5)

# train the model
fit <- train(Sepal.Length~., iris[tr, ], 
             method = "lm", 
             trControl = trC)
summary(fit)
# truncated results best model:
# Residual standard error: 0.2754 on 39 degrees of freedom
# Multiple R-squared:  0.9062,  Adjusted R-squared:  0.8941 

fit.p <- predict(fit, iris[-tr,])
postResample(fit.p, iris[-tr, ]$Sepal.Length)
#      RMSE  Rsquared       MAE 
# 0.2795920 0.8925574 0.2302402

如果您想查看每个折叠的性能，您也可以这样做。

fit$resample
#        RMSE  Rsquared       MAE Resample
# 1 0.3629901 0.7911634 0.2822708    Fold1
# 2 0.3680954 0.8888947 0.2960464    Fold2
# 3 0.3508317 0.8394489 0.2709989    Fold3
# 4 0.2548549 0.8954633 0.1960375    Fold4
# 5 0.3396910 0.8661239 0.3187768    Fold5

I think you may be overcomplicating things (which is something I do all the time...)

You don't need to go to great lengths to make what you are trying to make. This answer is broken down into three parts.

Building the data frame you're looking for (I think!)
Why you really don't need this data frame to be built
Why not use what's already out there?

Part 1

If I understand correctly, this is about what you're looking for (less the strings). I also included how you might use it with your actual data.

library(tidyverse)

giveMe <- function(rowCt, nfolds){
  # set.seed(235) # removed seed after establishing working function to incite
  #  the expected randomness

  folds <- cut(1:rowCt, breaks = nfolds, labels = F)
  # randomise the folds
  folds <- sample(folds, length(folds)) 
  # create the folds' sets
  kfold.df <- map_dfc(1:nfolds,
                      ~ifelse(folds != .x, T, F)) %>% 
  setNames(., paste0("foldSet_",1:nfolds)) %>%  # name each field
  add_column(IID = 1:rowCt, .before = 1) # add indices to the left

  return(kfold.df) # return a data frame
}

given <- giveMe(95, 5)

giveMore <- giveMe(nrow(iris), 5) # uses the built-in iris data set

Part 2

You could just create your random fold sequence and use that with a model, you don't need to stack them in a data frame. You have to loop through the model the same number of times, why not do it at the same time?

folds <- sample(cut(1:nrow(iris), 5, # no seed-- random on purpose
                    labels = F))

tellMe <- map(1:5, # the folds start in col 2
              ~lm(Sepal.Length~., 
                  iris[ifelse(folds != .x,
                              T, F), 
                       1:4])) # dropped 'Species' groups' issue

To check out the model performance:

map_dfr(1:5, .f = function(x){
  y = tellMe[[x]]
  sigma = sigma(y)
  rsq = summary(y)$adj.r.squared
  c(sigma = sigma, rsq = rsq)
})
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.334 0.844
# 2 0.309 0.869
# 3 0.302 0.846
# 4 0.330 0.847
# 5 0.295 0.872

Predict and inspect the testing performance

# create a list of the predictec values from the test data
showMe <- map(1:5,
              ~predict(tellMe[[.x]], 
                       iris[ifelse(folds == .x,
                                   T, F), 1:4]))

# Grab comparable metrics like those from the models
map_dfr(1:5,
        .f = function(x){
          A = iris[ifelse(folds == x, T, F), ]$Sepal.Length
          P = showMe[[x]]
          sigma = sqrt(sum((A - P)^2) / length(A))
          rsq = cor(A, P)^2
          c(sigma = sigma, rsq = rsq)
        })
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.232 0.919
# 2 0.342 0.774
# 3 0.366 0.884
# 4 0.250 0.906
# 5 0.384 0.790

Part 3

Here I'm going to use the caret library. However, there are a lot of other options.

library(caret)

set.seed(1)
# split training and testing 70/30%
tr <- createDataPartition(iris$Species, p = .7, list = F)

# set up 5-fold val
trC <- trainControl(method = "cv", number = 5)

# train the model
fit <- train(Sepal.Length~., iris[tr, ], 
             method = "lm", 
             trControl = trC)
summary(fit)
# truncated results best model:
# Residual standard error: 0.2754 on 39 degrees of freedom
# Multiple R-squared:  0.9062,  Adjusted R-squared:  0.8941 

fit.p <- predict(fit, iris[-tr,])
postResample(fit.p, iris[-tr, ]$Sepal.Length)
#      RMSE  Rsquared       MAE 
# 0.2795920 0.8925574 0.2302402

If you want to see each of the folds' performance, you can do that, too.

fit$resample
#        RMSE  Rsquared       MAE Resample
# 1 0.3629901 0.7911634 0.2822708    Fold1
# 2 0.3680954 0.8888947 0.2960464    Fold2
# 3 0.3508317 0.8394489 0.2709989    Fold3
# 4 0.2548549 0.8954633 0.1960375    Fold4
# 5 0.3396910 0.8661239 0.3187768    Fold5

回复收藏 0 原文

~没有更多了~