如何在Python中的多类分类问题上获取每个类别的形状值
我有以下数据框:
import pandas as pd
import random
import xgboost
import shap
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
我想运行分类算法以预测这三个类。
因此,我将数据集分为培训和测试集,并且现在进行了XGBoost分类,
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
现在我想获得每个类别的平均型形状值 ,而不是来自 absolute 从此代码生成的塑造值:
shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
另外,该图将class
标签为0,1,2。我怎么知道哪个级别为0,1& 2来自原始对应?
因为此代码:
shap.summary_plot(shap_values, X_test,
class_names= ['a', 'b', 'c'])
给出
和此代码:
shap.summary_plot(shap_values, X_test,
class_names= ['b', 'c', 'a'])
给予
所以我不确定传说了。 有什么想法吗?
I have the following dataframe:
import pandas as pd
import random
import xgboost
import shap
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
I want to run a classification algorithm to predict the 3 classes.
So I split my dataset into a training and testing set and I ran an xgboost classification
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
Now I would like to get the mean SHAP values for each class, instead of the mean from the absolute SHAP values generated from this code:
shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Also, the plot labels the class
as 0,1,2. How can I know to which class the 0,1 & 2 from the original correspond?
Because this code:
shap.summary_plot(shap_values, X_test,
class_names= ['a', 'b', 'c'])
gives
and this code:
shap.summary_plot(shap_values, X_test,
class_names= ['b', 'c', 'a'])
gives
So I am not sure about the legend anymore.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
形状值作为列表返回。您可以通过其索引访问有关形状的绝对值。
对于类0的摘要图,代码将是
SHAP values are returned as a list. You can access the regarding SHAP absolute values via their indices.
For the summary plot of your Class 0, the code would be
我有同样的问题,也许这个问题可以帮助: https://github.com/slundberg/ shap/issues/764
我还没有测试过它,但看起来顺序应该与调用
model.predict_proba()
时的顺序相同。在上面的链接中,建议使用摘要图的class_names=model.classes_
选项。I had the same question, perhaps this issue can help: https://github.com/slundberg/shap/issues/764
I haven't tested it yet, but it seems the order should be the same as the order you would have when calling
model.predict_proba()
. In the link above it is suggested to use theclass_names=model.classes_
option of the summary plot.通过进行一些研究并借助这篇文章 和 @Alessandro Nesti 的答案,这是我的解决方案:
导致
By doing some research and with the help of this post and @Alessandro Nesti 's answer, here is my solution:
which results in
这是 @ pent 的更新代码:
This is an updated code of @quant's code:
自定义解决方案是一个过度复杂的,恕我直言。
解决方案
说明
summary_plot
。这必须反映预测的顺序。由于一个先验的人不知道该顺序,因此通常可以将model.classes _
用于此目的;shap
坚持预测的原始顺序,而不是对它们进行排序:class_inds =“原始”
(请参阅相关代码“ nofollow noreferrer”>在这里)。ps i使用
shap 0.40.0
pps我无法运行您的示例,因为我的XGBoost版本不允许使用字符串作为目标类别。但是它可以与标签编码的目标或其他模型类型(
sklearn.randomforestclassifier
或lgb.lgb.lgb.lgbmclassifier
)一起使用。The custom solution is an over-complication, IMHO.
Solution
Explanation
summary_plot
. This has to reflect the order of predictions. Since one a priori doesn't know the order, then typically one can usemodel.classes_
for that purpose;shap
to stick to the original order of predictions instead of sorting them:class_inds="original"
(see the relevant code here).P.S. I use
shap 0.40.0
P.P.S. I was not able to run your example as my version of XGBoost doesn't allow to use strings as target categories. But it works with label-encoded target or with other model types (
sklearn.RandomForestClassifier
orlgb.LGBMClassifier
)首先,您需要使用LabelEncoder,然后使用classes_
First, you need to use LabelEncoder and then classes_