决策树分数总是返回1

发布于 2025-01-28 21:17:59 字数 1203 浏览 1 评论 0原文

使用决策树训练时,我的测试集得到100%的分数,而且我认为这有点奇怪,因为我设置了max_depth = 2。我不明白我做错了。我在火车和测试集上将套装分开,但是分类器仍然返回1。这是我的代码。

这是我的数据集 songs.csv 。关于此数据集的一些信息:大小:400个元素,类的分布几乎是统一的,所以我不明白为什么决策树给出这样的理想得分,即使max_depth = 2

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('D:Projects/datasets/songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.35)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

I got 100% score on my test set when trained using decision tree and I think it's a bit strange, given that i set max_depth = 2. And I don't understand that I did wrong. I split my set on train and test sets, but classifier still returns 1. Here is my code.

This is my dataset songs.csv. A little about this dataset: size: 400 elements, the distribution of classes is almost uniform, so I don’t understand why the decision tree gives such an ideal score even max_depth = 2.

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('D:Projects/datasets/songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.35)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

庆幸我还是我 2025-02-04 21:17:59

也许问题来自数据集。我尝试了RandomForestClassifier,并获得了1.0精度。同样,一起使用验证和测试数据。 But the result was not changed. This is the tree graph.

>>> text_representation = tree.export_text(clf)
>>> print(text_representation)

|--- feature_1 <= 0.50
|   |--- class: 2
|--- feature_1 >  0.50
|   |--- feature_1 <= 1.50
|   |   |--- class: 1
|   |--- feature_1 >  1.50
|   |   |--- class: 0

这是我的完整代码:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
    
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

text_representation = tree.export_text(clf)
print(text_representation)

clf = ensemble.RandomForestClassifier(criterion='entropy')
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

Maybe the problem comes from the dataset. I tried on RandomForestClassifierand got 1.0 accuracy. Similarly, used validation and test data together. But the result was not changed. This is the tree graph.

>>> text_representation = tree.export_text(clf)
>>> print(text_representation)

|--- feature_1 <= 0.50
|   |--- class: 2
|--- feature_1 >  0.50
|   |--- feature_1 <= 1.50
|   |   |--- class: 1
|   |--- feature_1 >  1.50
|   |   |--- class: 0

Here is my full code:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
    
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

text_representation = tree.export_text(clf)
print(text_representation)

clf = ensemble.RandomForestClassifier(criterion='entropy')
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
琉璃繁缕 2025-02-04 21:17:59

我使用您提供的代码和数据集培训了一个决策树模型,它似乎可以按预期工作。获得这样的小数据集的完美分类得分并不罕见,而且似乎很简单,可以通过高度2的树完美地解决。代码似乎没有错。

您可以使用 tree.plot_tree

作为理智检查,我还观察到,当您将树的高度限制在1中时,模型的准确性小于70%。因此,我认为您提供的代码很好。

I trained a decision-tree model using the code and dataset you have provided, it seems to work as expected. It's not unusual to get perfect classification score for small datasets like this one, and it seems the task is simple enough to be solved perfectly by a tree of height 2. There seems to be nothing wrong with the code.

You can visualise the resulting tree using tree.plot_tree:

enter image description here

As a sanity check, I also observed that the model accuracy is less than 70% when you limit the height of the tree to only 1. So I think the code you have provided is fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文