过采样之前和之后始终获得 1.0 的准确率和召回率
我有葡萄酒数据集,我删除了空值,标准化了数据集。然后创建了一个名为“质量”的新列,如果质量得分超过 0.7 则为好酒,如果低于 0.7 则为劣酒。我只是想尝试二元分类。
我尝试使用不平衡数据集进行 Logistic 回归,结果是:
[[418 0]
[ 0 60]]
precision recall f1-score support
0 1.00 1.00 1.00 418
1 1.00 1.00 1.00 60
accuracy 1.00 478
macro avg 1.00 1.00 1.00 478
weighted avg 1.00 1.00 1.00 478
所以我用 SMOTE 重新采样并尝试了 RandomForestClassifier,结果是:
Accuracy = 1.00
Recall = 1.00
我真的怀疑这是可能的
我做错了什么?
完整代码如下:
df.dropna(how='any', inplace=True)
df.isnull().sum()
#normalize
scaler = preprocessing.MinMaxScaler()
names = df.columns
d = scaler.fit_transform(df)
scaled_df = pd.DataFrame(d, columns=names)
scaled_df.head()
# Count unique values for the quality score.
scaled_df['quality'].value_counts()
Output menu
0.4 679
0.6 636
0.8 197
0.2 53
1.0 18
0.0 10
Name: quality, dtype: int64
#adding new column
conditions = [
(scaled_df['quality'] <= 0.7),
(scaled_df['quality'] >0.7)
]
values = [0, 1]
scaled_df['QualityLabel'] = np.select(conditions, values)
scaled_df
# We can use value counts
scaled_df['QualityLabel'].value_counts()
# or we can separate the classes and then print the shape
class_0 = scaled_df[scaled_df['QualityLabel'] == 0]
class_1 = scaled_df[scaled_df['QualityLabel'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)
class 0: (1378, 13)
class 1: (215, 13)
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(scaled_df, scaled_df["QualityLabel"], test_size=0.3, random_state=0)
np.bincount(y_train)
# now we have 2 bins, 960 for training and 155 for testing.y_train.head(10)
from sklearn.linear_model import LogisticRegression
#Initalize the classifier
clf = LogisticRegression(random_state=0)
#Fitting the training data
clf.fit(X_train, y_train)
#Predicting on test
y_pred=clf.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
#lets resample with SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_sm, y_sm = sm.fit_resample(scaled_df, scaled_df["QualityLabel"])
print(f'''Shape of X before SMOTE: {scaled_df.shape}
Shape of X after SMOTE: {X_sm.shape}''')
print('\nBalance of positive and negative classes (%):')
y_sm.value_counts(normalize=True) * 100
Output menu
Shape of X before SMOTE: (1593, 13)
Shape of X after SMOTE: (2756, 13)
Balance of positive and negative classes (%):
0 50.0
1 50.0
Name: QualityLabel, dtype: float64
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(
X_sm, y_sm, test_size=0.25, random_state=42
)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, preds):.2f}\nRecall = {recall_score(y_test, preds):.2f}\n')
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
plt.title('Confusion Matrix (with SMOTE)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');
I have the wine dataset, I removed nulls, normalized the dataset. and then created a new column called Quality, if the quality score is over 0.7 then its good, if its below then its a bad wine. I just wanted to try binary classification.
I tried Logistic Regression with the unbalanced data set and I get this:
[[418 0]
[ 0 60]]
precision recall f1-score support
0 1.00 1.00 1.00 418
1 1.00 1.00 1.00 60
accuracy 1.00 478
macro avg 1.00 1.00 1.00 478
weighted avg 1.00 1.00 1.00 478
SO I resampled with SMOTE and tried RandomForestClassifier and I get this:
Accuracy = 1.00
Recall = 1.00
I really doubt this is possible
What am I doing wrong?
Full code below:
df.dropna(how='any', inplace=True)
df.isnull().sum()
#normalize
scaler = preprocessing.MinMaxScaler()
names = df.columns
d = scaler.fit_transform(df)
scaled_df = pd.DataFrame(d, columns=names)
scaled_df.head()
# Count unique values for the quality score.
scaled_df['quality'].value_counts()
Output menu
0.4 679
0.6 636
0.8 197
0.2 53
1.0 18
0.0 10
Name: quality, dtype: int64
#adding new column
conditions = [
(scaled_df['quality'] <= 0.7),
(scaled_df['quality'] >0.7)
]
values = [0, 1]
scaled_df['QualityLabel'] = np.select(conditions, values)
scaled_df
# We can use value counts
scaled_df['QualityLabel'].value_counts()
# or we can separate the classes and then print the shape
class_0 = scaled_df[scaled_df['QualityLabel'] == 0]
class_1 = scaled_df[scaled_df['QualityLabel'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)
class 0: (1378, 13)
class 1: (215, 13)
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(scaled_df, scaled_df["QualityLabel"], test_size=0.3, random_state=0)
np.bincount(y_train)
# now we have 2 bins, 960 for training and 155 for testing.y_train.head(10)
from sklearn.linear_model import LogisticRegression
#Initalize the classifier
clf = LogisticRegression(random_state=0)
#Fitting the training data
clf.fit(X_train, y_train)
#Predicting on test
y_pred=clf.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
#lets resample with SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_sm, y_sm = sm.fit_resample(scaled_df, scaled_df["QualityLabel"])
print(f'''Shape of X before SMOTE: {scaled_df.shape}
Shape of X after SMOTE: {X_sm.shape}''')
print('\nBalance of positive and negative classes (%):')
y_sm.value_counts(normalize=True) * 100
Output menu
Shape of X before SMOTE: (1593, 13)
Shape of X after SMOTE: (2756, 13)
Balance of positive and negative classes (%):
0 50.0
1 50.0
Name: QualityLabel, dtype: float64
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(
X_sm, y_sm, test_size=0.25, random_state=42
)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, preds):.2f}\nRecall = {recall_score(y_test, preds):.2f}\n')
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
plt.title('Confusion Matrix (with SMOTE)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看您的代码,您没有从功能中删除标签
Quality
或QualityLabel
。如果这些仍然是您的特征,那么预测标签的概率将是 100%,因为它正是您的标签。让我们删除作为标签的列:
拟合模型:
您可以看到混淆矩阵现在更有意义:
Looking at your code, you did not drop the label
Quality
orQualityLabel
from your features. If these are still among your features, then predicting the label will be 100% since it is exactly your label.Let's remove the columns that are your labels:
Fit the model:
And you can see the confusion matrix makes more sense now: