将校准分类器与管道一起使用的正确方法

发布于 2025-01-13 17:02:00 字数 2039 浏览 0 评论 0原文

我按如下方式训练模型：

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)

under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)

#define pipeline 
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

然后我在此管道上进行网格搜索，

gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)

我的结果是：

best_est = gridsearch.best_estimator_

然后进行校准：

X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)

我通过校准传递这些，例如片段是

sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")

sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)

我的 SIG_CLF 具有最佳校准，所以我想使用它而不是我的'best_est['clf']。因此上面的 sig_clf 只是采用模型而不是预处理。当我对其他数据集进行预测时，例如“newdata”，以下内容有意义吗？

test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)

上面我使用管道的每个部分来转换名为“newdata”的外部数据集，然后应用校准的 sigmoid模型到转换后的数据集上，以给出最终的校准预测。这是正确的吗？

原文

I train a model as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)

under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)

#define pipeline 
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

i then do a gridsearch on this pipeline

gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)

my result is:

best_est = gridsearch.best_estimator_

I then carry out calibration:

X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)

I pass these through the calibration e.g. a snippet is

sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")

sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)

My SIG_CLF had the best calibration so i would like to use this rather than my 'best_est['clf']. Therefore the sig_clf above is just taking the model not preprocessing. When i come to make predictions on other datasets e.g. 'newdata' does the following make sense?

test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)

Above i am using every part of the pipeline to transform an external dataset called 'newdata' then i apply the calibrated sigmoid model onto the transformed dataset to give me final calibrated predictions. Is this correct?

分享到QQ

分享到微博