Cross_val_score中的交叉验证

发布于 2025-02-09 11:01:48 字数 725 浏览 2 评论 0 原文

将数据安装在Python中时，我通常会在做：

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

我将数据分成两个部分：一个用于培训，另一个用于测试。

之后，我将数据适合：

model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)

我可以通过：

accuracy_score(y_test,y_pred)

我理解这些步骤。但是 sklearn.model_selection.cross_val_score 中发生了什么？例如：

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).

它是在做我之前做过的所有事情，但是10次吗？

我必须将数据拆分进行训练，测试集吗？从我的理解中，它将数据分解，适合它，预测测试数据并获得准确性得分。 10次。一行。

但是我看不出火车和测试集有多大。我可以手动设置吗？每次运行的大小也相同吗？

原文

When fitting my data in python I'm usually doing:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

I splits my data into two chunks: one for training, other with testing.

After that I fit my data with:

model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)

And I can get the accuracy with:

accuracy_score(y_test,y_pred)

I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score? For example:

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).

Is it doing everything that I did before, but 10 times?

Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.

But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃颗糖壮壮胆 2025-02-16 11:01:50

函数“ Train_test_split”以分配比率随机分配火车和测试集。

而以下“ cross_val_score”函数的交叉验证作用10倍。

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)

在这种情况下，主要区别在于10倍CV不会将数据融合在一起，并且折叠的序列与原始数据相同。您应该批判性地思考数据的顺序是否对交叉验证很重要，这取决于您的特定应用程序。

选择要使用的验证方法：

您可以在此处阅读有关k-fold的文档的信息：< a href =“ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold.kfold” /stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold

The function "train_test_split" splits the train and test set randomly with a split ratio.

While the following "cross_val_score" function does 10-Fold cross-validation.

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)

In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.

Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best

You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

回复收藏 0 原文