如何使用GridSearchCV预测每次迭代的X_Test？

发布于 2025-02-03 12:29:40 字数 1944 浏览 4 评论 0原文

我正在考虑绘制x axis的图形图的复杂性（例如，knn中的n_neighbors）和y_axis是错误（例如均方误差）。

我目前正在使用网格搜索简历，我意识到，对于.cv_results _，它们仅显示火车数据错误。

KNN={
    'classifier':  [KNeighborsClassifier()],
    'classifier__n_neighbors':  [i for i in range (10, 200, 10)],
}

pipeline = Pipeline(
    steps = [('classifier', KNN["classifier"][0])]

    )

grid_serach_knn = GridSearchCV(pipeline, [KNN], n_jobs=-1).fit(x_train, y_train)

grid_serach_knn.cv_results _首先给我

'split0_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split1_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split2_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split3_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split4_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'mean_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'std_test_score': array([9.84e-04, 8.70e-04, 1.30e-03, 1.09e-03, 7.68e-04, 9.61e-04,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16]), 
'rank_test_score': array([3, 2, 1, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])}

，我不了解不同种类的测试分数。他们是训练分数吗？如果是，他们正在使用什么指标？精度/r^2/precision/召回？

其次，我如何使用model.predict（x_test）为每次迭代找到测试数据集的错误，以便我可以在顶部绘制图表？

原文

image of graph i want to plot

I'm thinking of plotting a graph where the x axis is the complexity of the graph (e.g., n_neighbors in KNN), and the y_axis is the error (like mean squared error).

I'm currently using Grid Search CV and I realise that for .cv_results_ , they only show the train data error.

KNN={
    'classifier':  [KNeighborsClassifier()],
    'classifier__n_neighbors':  [i for i in range (10, 200, 10)],
}

pipeline = Pipeline(
    steps = [('classifier', KNN["classifier"][0])]

    )

grid_serach_knn = GridSearchCV(pipeline, [KNN], n_jobs=-1).fit(x_train, y_train)

grid_serach_knn.cv_results_ would give me

'split0_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split1_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split2_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split3_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'split4_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'mean_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]), 
'std_test_score': array([9.84e-04, 8.70e-04, 1.30e-03, 1.09e-03, 7.68e-04, 9.61e-04,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16]), 
'rank_test_score': array([3, 2, 1, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])}

firstly, i don't understand the different kinds of test scores. are they the training scores? if they are, what metrics are they using? accuracy/r^2/precision/recall?

secondly, how would i use model.predict(X_test) for each iteration to find the error for the test dataset so that i can plot the graph at the top?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

舟遥客 2025-02-10 12:29:40

首先，我不了解不同种类的考试成绩。他们是训练分数吗？如果是，他们正在使用什么指标？精度/r^2/precision/召回？

不，这些是验证分数。每个值是19个数字的数组，对应于n_neighbors的19个不同值。您没有指定cv参数，因此gridSearchCV默认不默认将训练集分为5个部分并进行5次运行，每次使用这些零件之一作为验证设置和其他4训练模型。这就是“ split0”到“ split4”名称所指的内容。这些值都是精度得分。

例如，“''split0_test_score'：array（[0.97，...“”告诉您，在训练模型后，用n_neighbors = 10在80％的培训数据上，根据第一个拆分，模型将其剩余训练数据中的97％分类

为5分的平均分数，结果也包括在相应的标准偏差中。

其次，我如何使用model.predict（x_test）为每次迭代找到测试数据集的错误，以便我可以在顶部绘制图表？

请注意，GridSearchCV具有参数return_train_score。引用Scikit-Learn

return_train_score： bool，default = false
如果false，则cv_results _属性将不包括培训分数。计算培训分数用于了解不同的参数设置如何影响过度拟合/不足的权衡。但是，计算训练集上的分数在计算上可能很昂贵，并且绝对不需要选择产生最佳概括性能的参数。

因此，除了验证曲线之外，您还可以将其设置为true以获得训练分数并将其绘制为一条曲线。 Scikit-learn甚至具有validation_curve function to help with this, and an

但是，请注意，您显示的绘图根本没有提及（交叉）验证，并且您说您有一个单独的测试集，该测试集需要用于绘图。因此，更简单的方法不是进行任何交叉验证，而是要迭代n_neighbors值，而是每次将模型安装到整个训练设置中，并计算该模型的准确性得分（例如，具有<< a href =“ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html” rel =“ nofollow noreferrer”>对于训练组，一个用于测试集。在您的情况下，这种方法是可能的，因为目标是生产该图，并且除了n_neighbors之外，您对任何进一步的超参数都不感兴趣。

firstly, i don't understand the different kinds of test scores. are they the training scores? if they are, what metrics are they using? accuracy/r^2/precision/recall?

No, these are the validation scores. Each value is an array of 19 numbers, corresponding to the 19 different values for n_neighbors. You did not specify the cv parameter, so GridSearchCV defaulted to splitting your training set into 5 parts and doing 5 runs, each time using one of those parts as the validation set and the other 4 to train the model. This is what the "split0" to "split4" names refer to. The values are all accuracy scores.

For example, "'split0_test_score': array([0.97, ..." tells you that after training the model with n_neighbors=10 on 80% of the training data, according to the first split, the model classified 97% of the instances in the remaining training data correctly.

The mean scores over the 5 splits and the corresponding standard deviations and ranks are also included in the results.

secondly, how would i use model.predict(X_test) for each iteration to find the error for the test dataset so that i can plot the graph at the top?

Note that GridSearchCV has a parameter return_train_score. Quoting from the scikit-learn docs:

return_train_score : bool, default=False
If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

So you can set that to True to get the training scores and plot those as one curve, in addition to the validation curve. Scikit-learn even has a validation_curve function to help with this, and an example of how to use it.

However, note that the plot you show does not mention (cross-)validation at all, and you say you have a separate test set that you want to use for the plot. So instead of doing any cross-validation, a simpler approach is to iterate over the n_neighbors values, fit the model to the entire training set each time, and compute the accuracy scores of that model (e.g. with accuracy_score), one for the training set and one for the test set. This approach is possible in your case, because the goal is to produce the plot, and you are not interested in any further hyperparameters apart from n_neighbors.

回复收藏 0 原文

~没有更多了~