如何使用GridSearchCV预测每次迭代的X_Test?
我正在考虑绘制x axis的图形图的复杂性(例如,knn中的n_neighbors)和y_axis是错误(例如均方误差)。
我目前正在使用网格搜索简历,我意识到,对于.cv_results _
,它们仅显示火车数据错误。
KNN={
'classifier': [KNeighborsClassifier()],
'classifier__n_neighbors': [i for i in range (10, 200, 10)],
}
pipeline = Pipeline(
steps = [('classifier', KNN["classifier"][0])]
)
grid_serach_knn = GridSearchCV(pipeline, [KNN], n_jobs=-1).fit(x_train, y_train)
grid_serach_knn.cv_results _
首先给我
'split0_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split1_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split2_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split3_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split4_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'mean_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'std_test_score': array([9.84e-04, 8.70e-04, 1.30e-03, 1.09e-03, 7.68e-04, 9.61e-04,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16]),
'rank_test_score': array([3, 2, 1, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])}
,我不了解不同种类的测试分数。他们是训练分数吗?如果是,他们正在使用什么指标?精度/r^2/precision/召回?
其次,我如何使用model.predict(x_test)
为每次迭代找到测试数据集的错误,以便我可以在顶部绘制图表?
I'm thinking of plotting a graph where the x axis is the complexity of the graph (e.g., n_neighbors in KNN), and the y_axis is the error (like mean squared error).
I'm currently using Grid Search CV and I realise that for .cv_results_
, they only show the train data error.
KNN={
'classifier': [KNeighborsClassifier()],
'classifier__n_neighbors': [i for i in range (10, 200, 10)],
}
pipeline = Pipeline(
steps = [('classifier', KNN["classifier"][0])]
)
grid_serach_knn = GridSearchCV(pipeline, [KNN], n_jobs=-1).fit(x_train, y_train)
grid_serach_knn.cv_results_
would give me
'split0_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split1_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split2_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split3_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split4_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'mean_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'std_test_score': array([9.84e-04, 8.70e-04, 1.30e-03, 1.09e-03, 7.68e-04, 9.61e-04,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16]),
'rank_test_score': array([3, 2, 1, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])}
firstly, i don't understand the different kinds of test scores. are they the training scores? if they are, what metrics are they using? accuracy/r^2/precision/recall?
secondly, how would i use model.predict(X_test)
for each iteration to find the error for the test dataset so that i can plot the graph at the top?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,这些是验证分数。每个值是19个数字的数组,对应于
n_neighbors
的19个不同值。您没有指定cv
参数,因此gridSearchCV
默认不默认将训练集分为5个部分并进行5次运行,每次使用这些零件之一作为验证设置和其他4训练模型。这就是“ split0”到“ split4”名称所指的内容。这些值都是精度得分。例如,“''split0_test_score':array([0.97,...“”告诉您,在训练模型后,用
n_neighbors = 10
在80%的培训数据上,根据第一个拆分,模型将其剩余训练数据中的97%分类为5分的平均分数,结果也包括在相应的标准偏差中。
请注意,
GridSearchCV
具有参数return_train_score
。引用Scikit-Learn因此,除了验证曲线之外,您还可以将其设置为
true
以获得训练分数并将其绘制为一条曲线。 Scikit-learn甚至具有validation_curve function to help with this, and an但是,请注意,您显示的绘图根本没有提及(交叉)验证,并且您说您有一个单独的测试集,该测试集需要用于绘图。因此,更简单的方法不是进行任何交叉验证,而是要迭代
n_neighbors
值,而是每次将模型安装到整个训练设置中,并计算该模型的准确性得分(例如,具有<< a href =“ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html” rel =“ nofollow noreferrer”>对于训练组,一个用于测试集。在您的情况下,这种方法是可能的,因为目标是生产该图,并且除了n_neighbors
之外,您对任何进一步的超参数都不感兴趣。No, these are the validation scores. Each value is an array of 19 numbers, corresponding to the 19 different values for
n_neighbors
. You did not specify thecv
parameter, soGridSearchCV
defaulted to splitting your training set into 5 parts and doing 5 runs, each time using one of those parts as the validation set and the other 4 to train the model. This is what the "split0" to "split4" names refer to. The values are all accuracy scores.For example, "'split0_test_score': array([0.97, ..." tells you that after training the model with
n_neighbors=10
on 80% of the training data, according to the first split, the model classified 97% of the instances in the remaining training data correctly.The mean scores over the 5 splits and the corresponding standard deviations and ranks are also included in the results.
Note that
GridSearchCV
has a parameterreturn_train_score
. Quoting from the scikit-learn docs:So you can set that to
True
to get the training scores and plot those as one curve, in addition to the validation curve. Scikit-learn even has avalidation_curve
function to help with this, and an example of how to use it.However, note that the plot you show does not mention (cross-)validation at all, and you say you have a separate test set that you want to use for the plot. So instead of doing any cross-validation, a simpler approach is to iterate over the
n_neighbors
values, fit the model to the entire training set each time, and compute the accuracy scores of that model (e.g. withaccuracy_score
), one for the training set and one for the test set. This approach is possible in your case, because the goal is to produce the plot, and you are not interested in any further hyperparameters apart fromn_neighbors
.