如何正确地进行repeatedKfold CV？

发布于 01-18 23:03 字数 1040 浏览 2 评论 0原文

我正在使用随机森林进行二进制分类，数据集大小为977记录和6列。班级比率为77:23（数据集不平衡）

我了解到，不建议使用常规的Train_test拆分70和30。

，因为我的数据集很小，所以请在下面找到我的代码

方法1-完整数据-x，y ，

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

但我看到我们有完整的输入数据x立即传递给模型。这不是导致数据泄漏吗？意思是，如果我在进行分类编码，则必须根据完整数据集中遇到的所有类别进行操作。同样，考虑数据集的范围从2017年到2022年。模型可能在其中一个折叠中使用2021个数据并在2020年数据中验证它。

那么，使用repotedkfold是正确的吗？

方法2-只有火车数据 - x_train，y_train

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

可以帮助我了解哪种最佳使用方法？

原文

I am working on a binary classification using random forest with a dataset size of 977 records and 6 columns. class ratio is 77:23 (imbalanced dataset)

Since, my dataset is small, I learnt that it is not advisable to split using regular train_test split of 70 and 30.

So, I was thinking to do repeatedKfold CV. Please find my code below

Approach 1 - Full data - X, y

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

But I see that we have full input data X passed at once to the model. Doesn't this lead to data leakage? Meaning, if I am doing categorical encoding, we have to do based on all categories encountered in full dataset. Similarly, consider if a dataset ranges from the year 2017 to 2022. It is possible that model uses 2021 data in one of the folds and validate it on the 2020 data.

So, is it right to use repeatedKfold like the below?

Approach 2 - only train data - X_train, y_train

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

Can help me understand which will be the best approach to use?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

煞人兵器2025-01-25 23:03:05

我想说有两种方法可以做到这一点。第一种方法是手动编写用于训练和验证的代码。下面是它的代码示例：

scores = []
folds = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
for fold_n, (train_index, valid_index) in enumerate(folds.split(train, y, groups=train['breath_id'])):
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    encoder = LabelEncoder()
    encoder.fit(X_train[:, 0])
    X_train[:, 0] = encoder.transform(X_train[:, 0])
    X_valid [:, 0] = encoder.transform(X_valid [:, 0])
    rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)

    rf_boruta .fit(X_train, y_train)
    score = metrics.f1_score(y_valid, rf_boruta .predict(X_valid))
    
    scores.append(score)

第二种方法是使用 sklearn 中的 Pipeline：

import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# creating artificial data
X, y = make_classification(n_samples=1000, n_features=6, n_informative=4, n_redundant=2)
# making one of the column categorical
X[:, 0] = np.random.randint(0, 10, 1000)
# converting into DataFrame so that we can use column names
X = pd.DataFrame(X, columns = [str(i) for i in range(6)])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['1', '2']),
        ('cat', OneHotEncoder(), ['0']),
    ]
)                                  
                                  
rf = RandomForestClassifier(class_weight='balanced', max_depth=3, max_features='sqrt', n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10)
pipe = Pipeline([('transformer', preprocessor), ('rf', rf)])
scores = cross_val_score(rf, X, y, scoring='f1', cv=cv)
print(f'Mean f1: {np.mean(scores):.3f}')

I'd say that there are two ways to do it. The first way is to write the code for training and validating manually. Here is an example of a code for it:

scores = []
folds = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
for fold_n, (train_index, valid_index) in enumerate(folds.split(train, y, groups=train['breath_id'])):
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    encoder = LabelEncoder()
    encoder.fit(X_train[:, 0])
    X_train[:, 0] = encoder.transform(X_train[:, 0])
    X_valid [:, 0] = encoder.transform(X_valid [:, 0])
    rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)

    rf_boruta .fit(X_train, y_train)
    score = metrics.f1_score(y_valid, rf_boruta .predict(X_valid))
    
    scores.append(score)

The second way is to use Pipeline from sklearn:

import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# creating artificial data
X, y = make_classification(n_samples=1000, n_features=6, n_informative=4, n_redundant=2)
# making one of the column categorical
X[:, 0] = np.random.randint(0, 10, 1000)
# converting into DataFrame so that we can use column names
X = pd.DataFrame(X, columns = [str(i) for i in range(6)])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['1', '2']),
        ('cat', OneHotEncoder(), ['0']),
    ]
)                                  
                                  
rf = RandomForestClassifier(class_weight='balanced', max_depth=3, max_features='sqrt', n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10)
pipe = Pipeline([('transformer', preprocessor), ('rf', rf)])
scores = cross_val_score(rf, X, y, scoring='f1', cv=cv)
print(f'Mean f1: {np.mean(scores):.3f}')