机器学习模型仅预测数据集中的模式
我正在尝试对文本进行情感分析。我有909个短语在电子邮件中使用,并且在隔离时,我为它们的生气如何得分。
Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
现在,我将这两个列定义为“短语”和“愤怒”:
df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']
随后,我将这些数据拆分为20%用于测试,并且80%用于培训:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
现在,我将单词转换为x_train 使用tfidfvectorizer到数值数据:
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))
现在,我将x_traincv
转换为数组:
a = x_traincv.toarray()
我还将x_testcv
转换为数值数组:
x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()
但是,我现在有
mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
mnb.fit(x_testcv,y_test)
testmessage=x_test.iloc[i]
predictions = mnb.predict(x_testcv[i].reshape(1,-1))
error_score = error_score + (predictions-int(b[i]))**2
print(testmessage)
print(predictions)
print(error_score/len(x_test))
一个示例我得到的结果是:
带回来 [0] 当 [0] 事先表示歉意 [0] 你能吗? [0] 然后见 [0] 希望这封电子邮件能使您能很好。 [0] 提前致谢 [0] 很抱歉通知 [0] 你绝对正确 [0] 我深感遗憾 [0] 射击我 [0] 我期待 [0] 正如我已经说过的 [0] 你好 [0] 我们希望所有学生 [0] 如果还不算太晚 [0]
这大规模重复,即使对于显然非常生气的短语也是如此。当我从.csv文件中删除包含'0'的所有数据时,现在的模态值(10)是我句子的唯一预测。
Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.
Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
Now, I define both columns as 'phrases' and 'anger':
df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']
Subsequently, I split this data such that 20% is used for testing and 80% is used for training:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Now, I convert the words in x_train
to numerical data using TfidfVectorizer:
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))
Now, I convert x_traincv
to an array:
a = x_traincv.toarray()
I also convert x_testcv
to a numerical array:
x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()
Now, I have
mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
mnb.fit(x_testcv,y_test)
testmessage=x_test.iloc[i]
predictions = mnb.predict(x_testcv[i].reshape(1,-1))
error_score = error_score + (predictions-int(b[i]))**2
print(testmessage)
print(predictions)
print(error_score/len(x_test))
However, an example of the results I get are:
Bring it back
[0]
It is greatly appreciatd when
[0]
Apologies in advance
[0]
Can you please
[0]
See you then
[0]
I hope this email finds you well.
[0]
Thanks in advance
[0]
I am sorry to inform
[0]
You’re absolutely right
[0]
I am deeply regretful
[0]
Shoot me through
[0]
I’m looking forward to
[0]
As I already stated
[0]
Hello
[0]
We expect all students
[0]
If it’s not too late
[0]
and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.
Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
两件事,您正在将多元素拟合使用测试集。在您的循环中,您有
mnb.fit(x_testcv,y_test)
,但是您应该执行mnb.fit(x_traincv,y_train)
exact,在执行预处理时,您应该调用
fit_transform
仅在测试中的培训数据上您应仅调用transform
方法。Two things, you are fitting The MultinomialNB with the test set. In your loop you have
mnb.fit(x_testcv,y_test)
but you should domnb.fit(x_traincv,y_train)
Second, when performing pre-processing you should call the
fit_transform
only on the training data while on the test you should call only thetransform
method.