决策树分数总是返回1

发布于 2025-01-28 21:17:59 字数 1203 浏览 1 评论 0原文

使用决策树训练时，我的测试集得到100％的分数，而且我认为这有点奇怪，因为我设置了max_depth = 2。我不明白我做错了。我在火车和测试集上将套装分开，但是分类器仍然返回1。这是我的代码。

这是我的数据集 songs.csv 。关于此数据集的一些信息：大小：400个元素，类的分布几乎是统一的，所以我不明白为什么决策树给出这样的理想得分，即使max_depth = 2。

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('D:Projects/datasets/songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.35)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

原文

I got 100% score on my test set when trained using decision tree and I think it's a bit strange, given that i set max_depth = 2. And I don't understand that I did wrong. I split my set on train and test sets, but classifier still returns 1. Here is my code.

This is my dataset songs.csv. A little about this dataset: size: 400 elements, the distribution of classes is almost uniform, so I don’t understand why the decision tree gives such an ideal score even max_depth = 2.

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('D:Projects/datasets/songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.35)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

庆幸我还是我 2025-02-04 21:17:59

也许问题来自数据集。我尝试了RandomForestClassifier，并获得了1.0精度。同样，一起使用验证和测试数据。 But the result was not changed. This is the tree graph.

>>> text_representation = tree.export_text(clf)
>>> print(text_representation)

|--- feature_1 <= 0.50
|   |--- class: 2
|--- feature_1 >  0.50
|   |--- feature_1 <= 1.50
|   |   |--- class: 1
|   |--- feature_1 >  1.50
|   |   |--- class: 0

这是我的完整代码：

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
    
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

text_representation = tree.export_text(clf)
print(text_representation)

clf = ensemble.RandomForestClassifier(criterion='entropy')
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

Maybe the problem comes from the dataset. I tried on RandomForestClassifierand got 1.0 accuracy. Similarly, used validation and test data together. But the result was not changed. This is the tree graph.

>>> text_representation = tree.export_text(clf)
>>> print(text_representation)

|--- feature_1 <= 0.50
|   |--- class: 2
|--- feature_1 >  0.50
|   |--- feature_1 <= 1.50
|   |   |--- class: 1
|   |--- feature_1 >  1.50
|   |   |--- class: 0

Here is my full code:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
    
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

text_representation = tree.export_text(clf)
print(text_representation)

clf = ensemble.RandomForestClassifier(criterion='entropy')
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

回复收藏 0 原文