sklearn 中的决策树:序数数据仍然是一个严重的问题
我想展示一个示例并在这里寻求解决方案。这里有很多与决策树相关的查询,以及关于选择序数数据与分类数据等。我的示例以下面的代码形式给出:
from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
c1=pd.Series([0,1,2,2,2,0,1,2,0,1,2])
c2=pd.Series([0,1,1,2,0,1,0,0,2,1,1])
c3=pd.Series([0,1,1,2,0,1,1,2,0,2,2])
c4=pd.Series([0,1,2,0,0,2,2,1,2,0,1])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series([0,0,0,0,1,1,1,0,0,1,1])
dnl=pd.concat([c1,c2,c3,c4],axis=1)
d=dnl.to_numpy()
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(d, labels.to_numpy())
print(clf_tree)
score=clf_tree.score(d,labels.to_numpy())
error=1-score
print("The error= ",error)
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()
from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= ",cm)
结果:
更改 c4
编码(交换1s和0s) 以下更改树!误分类错误较小! c4=pd.Series([1,0,2,1,1,2,2,0,2,1,0])#修改后的编码:Veg:1, Glut:0,None:2
为什么决策树无法选择中间值作为条件?
I wanted to showcase a sample example and ask for a solution here. There are a lot of decision tree related queries here, and about choosing ordinal versus categorical data, etc. My example is given as a code below:
from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
c1=pd.Series([0,1,2,2,2,0,1,2,0,1,2])
c2=pd.Series([0,1,1,2,0,1,0,0,2,1,1])
c3=pd.Series([0,1,1,2,0,1,1,2,0,2,2])
c4=pd.Series([0,1,2,0,0,2,2,1,2,0,1])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series([0,0,0,0,1,1,1,0,0,1,1])
dnl=pd.concat([c1,c2,c3,c4],axis=1)
d=dnl.to_numpy()
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(d, labels.to_numpy())
print(clf_tree)
score=clf_tree.score(d,labels.to_numpy())
error=1-score
print("The error= ",error)
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()
from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= ",cm)
Results:
Changing c4
encoding(interchanging 1s and 0s) to below changes the tree! With a lesser misclassification error!c4=pd.Series([1,0,2,1,1,2,2,0,2,1,0])# Modified encoding: Veg:1, Glut:0,None:2
Why is the decision tree unable to pick mid values as conditions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我假设数字 0、1、2 代表不同的类别。那么你应该在构建树之前使用 one-hot 编码。结果将独立于类别的标签,例如“2”将被视为与“1”类似。在您的设置中,“2”将大于“1”,大于“0”,这意味着类别有顺序。
编辑:
I assume the numbers 0,1,2 stand for different categories. Then you should use one-hot encoding before you build the tree. The results will then independent of the label of the category, e.g. '2' will be treated similar as '1'. In your setup '2' will be larger than '1' larger than '0', meaning the categories have an order.
edit: