如何正确地进行repeatedKfold CV?
我正在使用随机森林进行二进制分类,数据集大小为977记录和6列。班级比率为77:23(数据集不平衡)
我了解到,不建议使用常规的Train_test拆分70和30。
,因为我的数据集很小,所以 请在下面找到我的代码
方法1-完整数据-x,y ,
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
但我看到我们有完整的输入数据x
立即传递给模型。这不是导致数据泄漏吗?意思是,如果我在进行分类编码,则必须根据完整数据集中遇到的所有类别进行操作。同样,考虑数据集的范围从2017年到2022年。模型可能在其中一个折叠中使用2021个数据并在2020年数据中验证它。
那么,使用repotedkfold
是正确的吗?
方法2-只有火车数据 - x_train,y_train
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
可以帮助我了解哪种最佳使用方法?
I am working on a binary classification using random forest with a dataset size of 977 records and 6 columns. class ratio is 77:23 (imbalanced dataset)
Since, my dataset is small, I learnt that it is not advisable to split using regular train_test split of 70 and 30.
So, I was thinking to do repeatedKfold CV. Please find my code below
Approach 1 - Full data - X, y
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
But I see that we have full input data X
passed at once to the model. Doesn't this lead to data leakage? Meaning, if I am doing categorical encoding, we have to do based on all categories encountered in full dataset. Similarly, consider if a dataset ranges from the year 2017 to 2022. It is possible that model uses 2021 data in one of the folds and validate it on the 2020 data.
So, is it right to use repeatedKfold
like the below?
Approach 2 - only train data - X_train, y_train
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
Can help me understand which will be the best approach to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

我想说有两种方法可以做到这一点。第一种方法是手动编写用于训练和验证的代码。下面是它的代码示例:
第二种方法是使用 sklearn 中的 Pipeline:
I'd say that there are two ways to do it. The first way is to write the code for training and validating manually. Here is an example of a code for it:
The second way is to use Pipeline from sklearn: