过采样之前和之后始终获得 1.0 的准确率和召回率

发布于 2025-01-09 01:26:55 字数 3499 浏览 2 评论 0原文

我有葡萄酒数据集，我删除了空值，标准化了数据集。然后创建了一个名为“质量”的新列，如果质量得分超过 0.7 则为好酒，如果低于 0.7 则为劣酒。我只是想尝试二元分类。

我尝试使用不平衡数据集进行 Logistic 回归，结果是：

[[418   0]
 [  0  60]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       418
           1       1.00      1.00      1.00        60

    accuracy                           1.00       478
   macro avg       1.00      1.00      1.00       478
weighted avg       1.00      1.00      1.00       478

所以我用 SMOTE 重新采样并尝试了 RandomForestClassifier，结果是：

Accuracy = 1.00
Recall = 1.00

我真的怀疑这是可能的

我做错了什么？

完整代码如下：

df.dropna(how='any', inplace=True)
df.isnull().sum()

#normalize
scaler = preprocessing.MinMaxScaler()
names = df.columns
d = scaler.fit_transform(df)
scaled_df = pd.DataFrame(d, columns=names)
scaled_df.head()

# Count unique values for the quality score.
scaled_df['quality'].value_counts()


Output menu
0.4    679
0.6    636
0.8    197
0.2     53
1.0     18
0.0     10
Name: quality, dtype: int64


#adding new column
conditions = [
    (scaled_df['quality'] <= 0.7),
    (scaled_df['quality'] >0.7)
    ]

values = [0, 1]

scaled_df['QualityLabel'] = np.select(conditions, values)
scaled_df

# We can use value counts
scaled_df['QualityLabel'].value_counts()

# or we can separate the classes and then print the shape 
class_0 = scaled_df[scaled_df['QualityLabel'] == 0]
class_1 = scaled_df[scaled_df['QualityLabel'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)

class 0: (1378, 13)
class 1: (215, 13)

from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(scaled_df, scaled_df["QualityLabel"], test_size=0.3, random_state=0)
np.bincount(y_train)

# now we have 2 bins, 960 for training and 155 for testing.y_train.head(10)

from sklearn.linear_model import LogisticRegression
#Initalize the classifier
clf = LogisticRegression(random_state=0)
#Fitting the training data
clf.fit(X_train, y_train)
#Predicting on test
y_pred=clf.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

#lets resample with SMOTE

from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)

X_sm, y_sm = sm.fit_resample(scaled_df,  scaled_df["QualityLabel"])

print(f'''Shape of X before SMOTE: {scaled_df.shape}
Shape of X after SMOTE: {X_sm.shape}''')

print('\nBalance of positive and negative classes (%):')
y_sm.value_counts(normalize=True) * 100

Output menu
Shape of X before SMOTE: (1593, 13)
Shape of X after SMOTE: (2756, 13)

Balance of positive and negative classes (%):
0    50.0
1    50.0
Name: QualityLabel, dtype: float64

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(
    X_sm, y_sm, test_size=0.25, random_state=42
)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(f'Accuracy = {accuracy_score(y_test, preds):.2f}\nRecall = {recall_score(y_test, preds):.2f}\n')
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
plt.title('Confusion Matrix (with SMOTE)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');

原文

I have the wine dataset, I removed nulls, normalized the dataset. and then created a new column called Quality, if the quality score is over 0.7 then its good, if its below then its a bad wine. I just wanted to try binary classification.

I tried Logistic Regression with the unbalanced data set and I get this:

[[418   0]
 [  0  60]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       418
           1       1.00      1.00      1.00        60

    accuracy                           1.00       478
   macro avg       1.00      1.00      1.00       478
weighted avg       1.00      1.00      1.00       478

SO I resampled with SMOTE and tried RandomForestClassifier and I get this:

Accuracy = 1.00
Recall = 1.00

I really doubt this is possible

What am I doing wrong?

Full code below:

df.dropna(how='any', inplace=True)
df.isnull().sum()

#normalize
scaler = preprocessing.MinMaxScaler()
names = df.columns
d = scaler.fit_transform(df)
scaled_df = pd.DataFrame(d, columns=names)
scaled_df.head()

# Count unique values for the quality score.
scaled_df['quality'].value_counts()


Output menu
0.4    679
0.6    636
0.8    197
0.2     53
1.0     18
0.0     10
Name: quality, dtype: int64


#adding new column
conditions = [
    (scaled_df['quality'] <= 0.7),
    (scaled_df['quality'] >0.7)
    ]

values = [0, 1]

scaled_df['QualityLabel'] = np.select(conditions, values)
scaled_df

# We can use value counts
scaled_df['QualityLabel'].value_counts()

# or we can separate the classes and then print the shape 
class_0 = scaled_df[scaled_df['QualityLabel'] == 0]
class_1 = scaled_df[scaled_df['QualityLabel'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)

class 0: (1378, 13)
class 1: (215, 13)

from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(scaled_df, scaled_df["QualityLabel"], test_size=0.3, random_state=0)
np.bincount(y_train)

# now we have 2 bins, 960 for training and 155 for testing.y_train.head(10)

from sklearn.linear_model import LogisticRegression
#Initalize the classifier
clf = LogisticRegression(random_state=0)
#Fitting the training data
clf.fit(X_train, y_train)
#Predicting on test
y_pred=clf.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

#lets resample with SMOTE

from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)

X_sm, y_sm = sm.fit_resample(scaled_df,  scaled_df["QualityLabel"])

print(f'''Shape of X before SMOTE: {scaled_df.shape}
Shape of X after SMOTE: {X_sm.shape}''')

print('\nBalance of positive and negative classes (%):')
y_sm.value_counts(normalize=True) * 100

Output menu
Shape of X before SMOTE: (1593, 13)
Shape of X after SMOTE: (2756, 13)

Balance of positive and negative classes (%):
0    50.0
1    50.0
Name: QualityLabel, dtype: float64

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(
    X_sm, y_sm, test_size=0.25, random_state=42
)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(f'Accuracy = {accuracy_score(y_test, preds):.2f}\nRecall = {recall_score(y_test, preds):.2f}\n')
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
plt.title('Confusion Matrix (with SMOTE)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最美不过初阳 2025-01-16 01:26:55

查看您的代码，您没有从功能中删除标签 Quality 或 QualityLabel 。如果这些仍然是您的特征，那么预测标签的概率将是 100%，因为它正是您的标签。

让我们删除作为标签的列：

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",sep=";")

df.dropna(how='any', inplace=True)
df.isnull().sum()

d = MinMaxScaler().fit_transform(df)
scaled_df = pd.DataFrame(d, columns=df.columns)
scaled_df['QualityLabel'] = np.where(scaled_df['quality']>=0.7,1,0)

X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(['QualityLabel','quality'],axis=1), scaled_df["QualityLabel"], test_size=0.3, random_state=0)

拟合模型：

clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

您可以看到混淆矩阵现在更有意义：

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[420  10]
 [ 39  11]]
              precision    recall  f1-score   support

           0       0.92      0.98      0.94       430
           1       0.52      0.22      0.31        50

    accuracy                           0.90       480
   macro avg       0.72      0.60      0.63       480
weighted avg       0.87      0.90      0.88       480

Looking at your code, you did not drop the label Quality or QualityLabel from your features. If these are still among your features, then predicting the label will be 100% since it is exactly your label.

Let's remove the columns that are your labels:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",sep=";")

df.dropna(how='any', inplace=True)
df.isnull().sum()

d = MinMaxScaler().fit_transform(df)
scaled_df = pd.DataFrame(d, columns=df.columns)
scaled_df['QualityLabel'] = np.where(scaled_df['quality']>=0.7,1,0)

X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(['QualityLabel','quality'],axis=1), scaled_df["QualityLabel"], test_size=0.3, random_state=0)

Fit the model:

clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

And you can see the confusion matrix makes more sense now:

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[420  10]
 [ 39  11]]
              precision    recall  f1-score   support

           0       0.92      0.98      0.94       430
           1       0.52      0.22      0.31        50

    accuracy                           0.90       480
   macro avg       0.72      0.60      0.63       480
weighted avg       0.87      0.90      0.88       480

回复收藏 0 原文

~没有更多了~