sklearn 中的决策树：序数数据仍然是一个严重的问题

发布于 2025-01-14 04:06:53 字数 1509 浏览 3 评论 0原文

我想展示一个示例并在这里寻求解决方案。这里有很多与决策树相关的查询，以及关于选择序数数据与分类数据等。我的示例以下面的代码形式给出：

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

c1=pd.Series([0,1,2,2,2,0,1,2,0,1,2])
c2=pd.Series([0,1,1,2,0,1,0,0,2,1,1])
c3=pd.Series([0,1,1,2,0,1,1,2,0,2,2])
c4=pd.Series([0,1,2,0,0,2,2,1,2,0,1])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series([0,0,0,0,1,1,1,0,0,1,1])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
d=dnl.to_numpy()

clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(d, labels.to_numpy())
print(clf_tree)

score=clf_tree.score(d,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= ",cm)

结果：

更改 c4 编码（交换1s和0s) 以下更改树！误分类错误较小！ c4=pd.Series([1,0,2,1,1,2,2,0,2,1,0])#修改后的编码：Veg:1, Glut:0,None:2

为什么决策树无法选择中间值作为条件？

原文

I wanted to showcase a sample example and ask for a solution here. There are a lot of decision tree related queries here, and about choosing ordinal versus categorical data, etc. My example is given as a code below:

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

c1=pd.Series([0,1,2,2,2,0,1,2,0,1,2])
c2=pd.Series([0,1,1,2,0,1,0,0,2,1,1])
c3=pd.Series([0,1,1,2,0,1,1,2,0,2,2])
c4=pd.Series([0,1,2,0,0,2,2,1,2,0,1])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series([0,0,0,0,1,1,1,0,0,1,1])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
d=dnl.to_numpy()

clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(d, labels.to_numpy())
print(clf_tree)

score=clf_tree.score(d,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= ",cm)

Results:

Changing c4 encoding(interchanging 1s and 0s) to below changes the tree! With a lesser misclassification error!
c4=pd.Series([1,0,2,1,1,2,2,0,2,1,0])# Modified encoding: Veg:1, Glut:0,None:2

Why is the decision tree unable to pick mid values as conditions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深者入戏 2025-01-21 04:06:53

我假设数字 0、1、2 代表不同的类别。那么你应该在构建树之前使用 one-hot 编码。结果将独立于类别的标签，例如“2”将被视为与“1”类似。在您的设置中，“2”将大于“1”，大于“0”，这意味着类别有顺序。

编辑：

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder

enc= OneHotEncoder(sparse=False)

c1=pd.Series(['0','1','2','2','2','0','1','2','0','1','2'])
c2=pd.Series(['0','1','1','2','0','1','0','0','2','1','1'])
c3=pd.Series(['0','1','1','2','0','1','1','2','0','2','2'])
c4=pd.Series(['0','1','2','0','0','2','2','1','2','0','1'])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series(['0','0','0','0','1','1','1','0','0','1','1'])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
dnl=dnl.to_numpy()

enc.fit(dnl)

dnl=enc.transform(dnl)
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(dnl, labels.to_numpy()) #edited d to dnl 
print(clf_tree)

score=clf_tree.score(dnl,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= \n",cm)

I assume the numbers 0,1,2 stand for different categories. Then you should use one-hot encoding before you build the tree. The results will then independent of the label of the category, e.g. '2' will be treated similar as '1'. In your setup '2' will be larger than '1' larger than '0', meaning the categories have an order.

edit:

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder

enc= OneHotEncoder(sparse=False)

c1=pd.Series(['0','1','2','2','2','0','1','2','0','1','2'])
c2=pd.Series(['0','1','1','2','0','1','0','0','2','1','1'])
c3=pd.Series(['0','1','1','2','0','1','1','2','0','2','2'])
c4=pd.Series(['0','1','2','0','0','2','2','1','2','0','1'])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series(['0','0','0','0','1','1','1','0','0','1','1'])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
dnl=dnl.to_numpy()

enc.fit(dnl)

dnl=enc.transform(dnl)
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(dnl, labels.to_numpy()) #edited d to dnl 
print(clf_tree)

score=clf_tree.score(dnl,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= \n",cm)

回复收藏 0 原文

~没有更多了~