如何使用“ Scikit-Learn校准”微调LightGBM之后
我对LGBM进行了微调并进行了校准,但是在应用校准方面遇到了麻烦。
我有1)火车,2)有效,3)测试数据。
我使用1)火车数据和2)有效数据训练和微调的LGBM。 然后,我得到了LGBM的最佳参数。
之后,我想校准,以使模型的输出可以直接解释为置信度。但是我很困惑使用Scikit-Learn 校准ClassifierCV 。
在我的情况下,我应该使用cv ='prefit'还是cv = 5?另外,我是否应该使用火车数据或有效的数据拟合校准校准Classifiercv?
1)未校准的_clf,但是在训练之后
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
2-1
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
)
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
2-3)Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
哪一个是对的?一切都是对的,或者只有一个或两个是对的?
以下是代码。
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import calibration_curve
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb
import matplotlib.pyplot as plt
np.random.seed(0)
n_samples = 10000
X, y = make_classification(
n_samples=3*n_samples, n_features=20, n_informative=2,
n_classes=2, n_redundant=2, random_state=32)
#n_samples = N_SAMPLES//10
X_train, y_train = X[:n_samples], y[:n_samples]
X_valid, y_valid = X[n_samples:2*n_samples], y[n_samples:2*n_samples]
X_test, y_test = X[2*n_samples:], y[2*n_samples:]
plt.figure(figsize=(12, 9))
plt.plot([0, 1], [0, 1], '--', color='gray')
# 1) Uncalibrated_clf but fine-tuned on training data
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
y_prob = clf.predict_proba(X_test)[:, 1]
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob, n_bins=10)
plt.plot(
fraction_of_positives,
mean_predicted_value,
'o-', label='uncalibrated_clf')
# 2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob1 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives1, mean_predicted_value1 = calibration_curve(y_test, y_prob1, n_bins=10)
plt.plot(
fraction_of_positives1,
mean_predicted_value1,
'o-', label='calibrated_clf1')
# 2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
y_prob2 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives2, mean_predicted_value2 = calibration_curve(y_test, y_prob2, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf2')
plt.legend()
# 2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob3 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives3, mean_predicted_value3 = calibration_curve(y_test, y_prob3, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf3')
plt.legend()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
解决此问题的方法是:
a)拟合模型并校准固定集合
(这实际上是此处的预处理的含义:该模型已经拟合了,现在已拟合新的相关集合并校准输出)。
b)拟合模型并在训练集上对训练集进行校准,
通常是交叉验证的数量和方法(在Scikit-Learn的术语中与PLATT缩放或Sigmoid的方法)严格取决于您的数据和设置。因此,我建议将它们放在网格搜索中,看看是什么产生了最佳结果。
最后,可以在这里找到更深入的潜水:
The way to go about this is:
a) fit the model and calibrate on the hold out set
(this is actually the meaning of prefit here: the model is already fitted now take a new relevant set and calibrate the output).
b) fit the model and calibrate with cross validation on the training set
As is usually the case the number of cross validations and the method (isotonic regression vs Platt scaling or sigmoid in scikit-learn's jargon) critically depends on your data and your setup. Therefore, I'd suggest to put those in a grid search and see what produces the best results.
Finally, a deeper dive can be found here:
https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/