将校准分类器与管道一起使用的正确方法

发布于 2025-01-13 17:02:00 字数 2039 浏览 0 评论 0原文

我按如下方式训练模型:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)

under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)

#define pipeline 
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

然后我在此管道上进行网格搜索,

gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)

我的结果是:

best_est = gridsearch.best_estimator_

然后进行校准:

X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)

我通过校准传递这些,例如片段是

sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")

sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)

我的 SIG_CLF 具有最佳校准,所以我想使用它而不是我的'best_est['clf']。因此上面的 sig_clf 只是采用模型而不是预处理。当我对其他数据集进行预测时,例如“newdata”,以下内容有意义吗?

test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)

上面我使用管道的每个部分来转换名为“newdata”的外部数据集,然后应用校准的 sigmoid模型到转换后的数据集上,以给出最终的校准预测。这是正确的吗?

I train a model as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)

under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)

#define pipeline 
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

i then do a gridsearch on this pipeline

gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)

my result is:

best_est = gridsearch.best_estimator_

I then carry out calibration:

X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)

I pass these through the calibration e.g. a snippet is

sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")

sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)

My SIG_CLF had the best calibration so i would like to use this rather than my 'best_est['clf']. Therefore the sig_clf above is just taking the model not preprocessing. When i come to make predictions on other datasets e.g. 'newdata' does the following make sense?

test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)

Above i am using every part of the pipeline to transform an external dataset called 'newdata' then i apply the calibrated sigmoid model onto the transformed dataset to give me final calibrated predictions. Is this correct?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文