机器学习模型仅预测数据集中的模式

发布于 2025-02-09 18:59:47 字数 1989 浏览 0 评论 0原文

我正在尝试对文本进行情感分析。我有909个短语在电子邮件中使用，并且在隔离时，我为它们的生气如何得分。

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

现在，我将这两个列定义为“短语”和“愤怒”：

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

随后，我将这些数据拆分为20％用于测试，并且80％用于培训：

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

现在，我将单词转换为x_train 使用tfidfvectorizer到数值数据：

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

现在，我将x_traincv转换为数组：

a = x_traincv.toarray()

我还将x_testcv转换为数值数组：

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

但是，我现在有

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score + (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

一个示例我得到的结果是：

带回来 [0] 当 [0] 事先表示歉意 [0] 你能吗？ [0] 然后见 [0] 希望这封电子邮件能使您能很好。 [0] 提前致谢 [0] 很抱歉通知 [0] 你绝对正确 [0] 我深感遗憾 [0] 射击我 [0] 我期待 [0] 正如我已经说过的 [0] 你好 [0] 我们希望所有学生 [0] 如果还不算太晚 [0]

这大规模重复，即使对于显然非常生气的短语也是如此。当我从.csv文件中删除包含'0'的所有数据时，现在的模态值（10）是我句子的唯一预测。

Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?

原文

I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

Now, I define both columns as 'phrases' and 'anger':

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

Subsequently, I split this data such that 20% is used for testing and 80% is used for training:

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

Now, I convert the words in x_train to numerical data using TfidfVectorizer:

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

Now, I convert x_traincv to an array:

a = x_traincv.toarray()

I also convert x_testcv to a numerical array:

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

Now, I have

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score + (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

However, an example of the results I get are:

Bring it back
[0]
It is greatly appreciatd when
[0]
Apologies in advance
[0]
Can you please
[0]
See you then
[0]
I hope this email finds you well.
[0]
Thanks in advance
[0]
I am sorry to inform
[0]
You’re absolutely right
[0]
I am deeply regretful
[0]
Shoot me through
[0]
I’m looking forward to
[0]
As I already stated
[0]
Hello
[0]
We expect all students
[0]
If it’s not too late
[0]

and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.

Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?

分享到QQ

分享到微博