将数据安装在Python中时,我通常会在做:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
我将数据分成两个部分:一个用于培训,另一个用于测试。
之后,我将数据适合:
model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)
我可以通过:
accuracy_score(y_test,y_pred)
我理解这些步骤。
但是 sklearn.model_selection.cross_val_score
中发生了什么?例如:
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).
它是在做我之前做过的所有事情,但是10次吗?
我必须将数据拆分进行训练,测试集吗?从我的理解中,它将数据分解,适合它,预测测试数据并获得准确性得分。 10次。一行。
但是我看不出火车和测试集有多大。我可以手动设置吗?每次运行的大小也相同吗?
When fitting my data in python I'm usually doing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I splits my data into two chunks: one for training, other with testing.
After that I fit my data with:
model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)
And I can get the accuracy with:
accuracy_score(y_test,y_pred)
I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score
? For example:
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).
Is it doing everything that I did before, but 10 times?
Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.
But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?
发布评论
评论(2)
函数“ Train_test_split”以分配比率随机分配火车和测试集。
而以下“ cross_val_score”函数的交叉验证作用10倍。
在这种情况下,主要区别在于10倍CV不会将数据融合在一起,并且折叠的序列与原始数据相同。您应该批判性地思考数据的顺序是否对交叉验证很重要,这取决于您的特定应用程序。
选择要使用的验证方法:
您可以在此处阅读有关k-fold的文档的信息:< a href =“ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold.kfold” /stable/modules/generated/sklearn.model_selection.kfold.html#sklearn.model_selection.kfold
The function "train_test_split" splits the train and test set randomly with a split ratio.
While the following "cross_val_score" function does 10-Fold cross-validation.
In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.
Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best
You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
根据我的理解,如果您设置
cv = 10
,它将将您的数据集分为10倍。因此,如果您有1000行数据,那通常是培训数据集,其余100个将是您的测试数据集。因此,您无需像您在train_test_split
中所做的那样设置任何test_size
。Based on my understanding, if you set
cv=10
, it will divide your dataset into 10 folds. So if you have 1000 rows of data, that's mean 900 will be training dataset and the rest of the 100 will be your testing dataset. Hence, you are not required to set anytest_size
like what you did intrain_test_split
.