机器学习模型仅预测数据集中的模式

发布于 2025-02-09 18:59:47 字数 1989 浏览 0 评论 0原文

我正在尝试对文本进行情感分析。我有909个短语在电子邮件中使用,并且在隔离时,我为它们的生气如何得分。

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

现在,我将这两个列定义为“短语”和“愤怒”:

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

随后,我将这些数据拆分为20%用于测试,并且80%用于培训:

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

现在,我将单词转换为x_train 使用tfidfvectorizer到数值数据:

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

现在,我将x_traincv转换为数组:

a = x_traincv.toarray()

我还将x_testcv转换为数值数组:

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

但是,我现在有

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score + (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

一个示例我得到的结果是:

带回来 [0] 当 [0] 事先表示歉意 [0] 你能吗? [0] 然后见 [0] 希望这封电子邮件能使您能很好。 [0] 提前致谢 [0] 很抱歉通知 [0] 你绝对正确 [0] 我深感遗憾 [0] 射击我 [0] 我期待 [0] 正如我已经说过的 [0] 你好 [0] 我们希望所有学生 [0] 如果还不算太晚 [0]

这大规模重复,即使对于显然非常生气的短语也是如此。当我从.csv文件中删除包含'0'的所有数据时,现在的模态值(10)是我句子的唯一预测。

Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?

I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

Now, I define both columns as 'phrases' and 'anger':

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

Subsequently, I split this data such that 20% is used for testing and 80% is used for training:

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

Now, I convert the words in x_train to numerical data using TfidfVectorizer:

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

Now, I convert x_traincv to an array:

a = x_traincv.toarray()

I also convert x_testcv to a numerical array:

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

Now, I have

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score + (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

However, an example of the results I get are:

Bring it back
[0]
It is greatly appreciatd when
[0]
Apologies in advance
[0]
Can you please
[0]
See you then
[0]
I hope this email finds you well.
[0]
Thanks in advance
[0]
I am sorry to inform
[0]
You’re absolutely right
[0]
I am deeply regretful
[0]
Shoot me through
[0]
I’m looking forward to
[0]
As I already stated
[0]
Hello
[0]
We expect all students
[0]
If it’s not too late
[0]

and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.

Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

活雷疯 2025-02-16 18:59:47

两件事,您正在将多元素拟合使用测试集。在您的循环中,您有mnb.fit(x_testcv,y_test),但是您应该执行mnb.fit(x_traincv,y_train)

exact,在执行预处理时,您应该调用fit_transform仅在测试中的培训数据上您应仅调用transform方法。

Two things, you are fitting The MultinomialNB with the test set. In your loop you have mnb.fit(x_testcv,y_test) but you should do mnb.fit(x_traincv,y_train)

Second, when performing pre-processing you should call the fit_transform only on the training data while on the test you should call only the transform method.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文