Cross_val_score中的交叉验证

发布于 2025-02-09 11:01:48 字数 725 浏览 2 评论 0 原文

将数据安装在Python中时,我通常会在做:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

我将数据分成两个部分:一个用于培训,另一个用于测试。

之后,我将数据适合:

model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)

我可以通过:

accuracy_score(y_test,y_pred)

我理解这些步骤。 但是 sklearn.model_selection.cross_val_score 中发生了什么?例如:

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10). 

它是在做我之前做过的所有事情,但是10次吗?

我必须将数据拆分进行训练,测试集吗?从我的理解中,它将数据分解,适合它,预测测试数据并获得准确性得分。 10次​​。一行。

但是我看不出火车和测试集有多大。我可以手动设置吗?每次运行的大小也相同吗?

When fitting my data in python I'm usually doing:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

I splits my data into two chunks: one for training, other with testing.

After that I fit my data with:

model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)

And I can get the accuracy with:

accuracy_score(y_test,y_pred)

I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score? For example:

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10). 

Is it doing everything that I did before, but 10 times?

Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.

But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吃颗糖壮壮胆 2025-02-16 11:01:50

函数“ Train_test_split”以分配比率随机分配火车和测试集。

而以下“ cross_val_score”函数的交叉验证作用10倍。

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)

在这种情况下,主要区别在于10倍CV不会将数据融合在一起,并且折叠的序列与原始数据相同。您应该批判性地思考数据的顺序是否对交叉验证很重要,这取决于您的特定应用程序。

选择要使用的验证方法:

您可以在此处阅读有关k-fold的文档的信息:< a href =“ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold.kfold” /stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold

The function "train_test_split" splits the train and test set randomly with a split ratio.

While the following "cross_val_score" function does 10-Fold cross-validation.

cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)

In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.

Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best

You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

扬花落满肩 2025-02-16 11:01:50

根据我的理解,如果您设置 cv = 10 ,它将将您的数据集分为10倍。因此,如果您有1000行数据,那通常是培训数据集,其余100个将是您的测试数据集。因此,您无需像您在 train_test_split 中所做的那样设置任何 test_size

Based on my understanding, if you set cv=10, it will divide your dataset into 10 folds. So if you have 1000 rows of data, that's mean 900 will be training dataset and the rest of the 100 will be your testing dataset. Hence, you are not required to set any test_size like what you did in train_test_split.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文