在天真的贝叶斯应用交叉验证
我将数据集分为60%的培训,20%的测试和20%的验证数据
将数据分为测试,培训和验证培训
from sklearn.model_selection import train_test_split
data['label'] = (data['label'].replace({'ham' : 0,
'spam' : 1}))
X_train, X_test, y_train, y_test = train_test_split(data['message'],
data['label'], test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))
来自Sklearn的多键型MultinomialnB
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
naive_bayes = MultinomialNB().fit(train_data,
y_train)
predictions = naive_bayes.predict(test_data)
评估该模型
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
accuracy_score = accuracy_score(y_test,
predictions)
precision_score = precision_score(y_test,
predictions)
recall_score = recall_score(y_test,
predictions)
f1_score = f1_score(y_test,
predictions)
我的问题是在验证中。错误说
warnings.warn("Estimator fit failed. The score on this train-test"
这是我编写验证的方式,不知道我是否在做正确的事情”
from sklearn.model_selection import cross_val_score
mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,值得注意的是,因为它被称为交叉验证并不意味着您必须像在代码中一样使用验证集进行crossVal。您将执行交叉验证的原因有很多,其中包括:
因此,您在这里倾向于第一种用例。因此,您无需首先执行
train,Val和Test
的拆分。相反,您可以在整个数据集中执行10倍的交叉验证。如果您要进行降压,那么您可以持有一组30%,并使用剩余的70%进行交叉验证。一旦确定了最佳参数,您就可以使用Hold-Out集来执行具有最佳参数的模型评估。
一些参考:
> https://towardsdatascience.com/5-reasons-why-you-should-should-should-ish-cross-validation-in-your-data-science-project-8163311a1e79 .analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/" rel="nofollow noreferrer">https://www.analyticsvidhya.com/blog/2021/11 /top-7-cross-validation-techniques-with-python-code/
https://towardsdatascience.com/train-test-split-split-and-cross-validation-in-python-80b61beca4b6
First, it is worth noting that because it's called cross validation doesn't mean you have to use a validation set as you have done in your code, to do the crossval. There are a number of reasons why you would perform cross validation which include:
Hence, your case here lean toward the first use case. As such you don't need to first perform a split of
train, val, and test
. Instead you can perform the 10-fold cross validation on your entire dataset.If you are doing hyparameterization, then you can have a hold-out set of say 30% and use the remaining 70% for cross validation. Once the best parameters have been determined, you can then use the hold-out set to perform an evaluation of the model with the best parameters.
Some refs:
https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79
https://www.analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
我没有任何错误或警告。也许可以工作。
结果:
I did not get any error or warning. Maybe it can be worked.
Result: