我们如何使用K折交叉验证预测测试样品的目标?
我正在尝试使用“ nofollow noreferrer”>太空飞船泰坦尼克号。
我要做的是使用test.csv
中的功能执行3倍交叉验证并预测目标变量(运输
)。我唯一能做的就是在训练集上教一个模型,因为它既包含我的功能又包含我的响应。我正在尝试做的事情:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold
from sklearn.neighbors import KNeighborsClassifier
X, y = train_ready.drop('Transported', axis=1), train_ready['Transported']
# 3-Fold Cross-Validation -----
cross_validation = KFold(n_splits=3, random_state=2022, shuffle=True)
classifier = KNeighborsClassifier(n_neighbors=10)
scores = cross_val_score(classifier, X, y, cv=cross_validation)
y_pred = cross_val_predict(classifier, X, y, cv=cross_validation)
y_test_predictions = cross_val_predict(classifier, test_ready, cv=cross_validation)
> TypeError: fit() missing 1 required positional argument: 'y'
而且,显然,我无法从test.csv
数据集中预测我的目标,因为它没有此列。该任务的正确算法是什么?我在做什么错?
PS,我会感谢您的耐心,因为我是Python及其语法的ML的新手;以前的经验主要是在R中。
I am trying to learn ML techniques in Python using Spaceship Titanic.
What I am trying to do is to perform a 3-fold cross-validation and predict the target variable (Transported
) using features from test.csv
. The only thing that I can do is to teach a model on my training set as it contains both my features and my response. What I am trying to do:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold
from sklearn.neighbors import KNeighborsClassifier
X, y = train_ready.drop('Transported', axis=1), train_ready['Transported']
# 3-Fold Cross-Validation -----
cross_validation = KFold(n_splits=3, random_state=2022, shuffle=True)
classifier = KNeighborsClassifier(n_neighbors=10)
scores = cross_val_score(classifier, X, y, cv=cross_validation)
y_pred = cross_val_predict(classifier, X, y, cv=cross_validation)
y_test_predictions = cross_val_predict(classifier, test_ready, cv=cross_validation)
> TypeError: fit() missing 1 required positional argument: 'y'
And, obviously, I cannot predict my target from the test.csv
dataset as it does not have this column. What is the right algorithm for this task and what am I doing wrong?
P.S. I will kindly appreciate your patience as I am new to ML in Python and its syntax; previous experience was primarily in R.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以这样考虑,交叉验证用于确定最佳模型并优化超级参数。一旦确定了哪个模型和超参数,您将使用完整数据集进行一次训练模型,并对未知数据进行预测。因此,在做出最终预测时,您不应该尝试使用任何交叉验证功能。取而代之的是,您应该做类似的事情,
您可以说明一些培训数据,因为在对未知数据集做出最终预测之前,该模型不会过度拟合。 。
You can think of it like this, cross validation is used to determine the best model and optimize hyper parameters. Once you have determined which model and hyperparameters you train the model one more time with the full dataset and do predictions on the unknown data. So when making the final predictions you shouldn't try to use any cross validation function. Instead you should do something like this
You could ofcourse hold out some training data as a sanity check that the model doesn't overfit before making final predictions on the unknown dataset altough the cross validation should have you convinced that is not going to be the case.