如何在Python中的多类分类问题上获取每个类别的形状值

发布于 2025-01-19 03:28:40 字数 1969 浏览 2 评论 0原文

我有以下数据框：

import pandas as pd
import random

import xgboost
import shap

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

我想运行分类算法以预测这三个类。

因此，我将数据集分为培训和测试集，并且现在进行了XGBoost分类，

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)

现在我想获得每个类别的平均型形状值，而不是来自 absolute 从此代码生成的塑造值：

shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)

另外，该图将class标签为0,1,2。我怎么知道哪个级别为0,1＆amp; 2来自原始对应？

因为此代码：

shap.summary_plot(shap_values, X_test,
                 class_names= ['a', 'b', 'c'])

给出

和此代码：

shap.summary_plot(shap_values, X_test,
                 class_names= ['b', 'c', 'a'])

给予

，

所以我不确定传说了。有什么想法吗？

原文

I have the following dataframe:

import pandas as pd
import random

import xgboost
import shap

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

I want to run a classification algorithm to predict the 3 classes.

So I split my dataset into a training and testing set and I ran an xgboost classification

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)

Now I would like to get the mean SHAP values for each class, instead of the mean from the absolute SHAP values generated from this code:

shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)

Also, the plot labels the class as 0,1,2. How can I know to which class the 0,1 & 2 from the original correspond?

Because this code:

shap.summary_plot(shap_values, X_test,
                 class_names= ['a', 'b', 'c'])

gives

and this code:

shap.summary_plot(shap_values, X_test,
                 class_names= ['b', 'c', 'a'])

gives

So I am not sure about the legend anymore.
Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱娆 2025-01-26 03:28:40

形状值作为列表返回。您可以通过其索引访问有关形状的绝对值。

对于类0的摘要图，代码将是

shap.summary_plot(shap_values[0], X_test)

SHAP values are returned as a list. You can access the regarding SHAP absolute values via their indices.

For the summary plot of your Class 0, the code would be

shap.summary_plot(shap_values[0], X_test)

回复收藏 0 原文

一片旧的回忆 2025-01-26 03:28:40

我有同样的问题，也许这个问题可以帮助： https://github.com/slundberg/ shap/issues/764

我还没有测试过它，但看起来顺序应该与调用 model.predict_proba() 时的顺序相同。在上面的链接中，建议使用摘要图的 class_names=model.classes_ 选项。

回复收藏 0 原文

游魂 2025-01-26 03:28:40

通过进行一些研究并借助这篇文章和 @Alessandro Nesti 的答案，这是我的解决方案：

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)

def get_ABS_SHAP(df_shap,df):
    #import matplotlib as plt
    # Make a copy of the input data
    shap_v = pd.DataFrame(df_shap)
    feature_list = df.columns
    shap_v.columns = feature_list
    df_v = df.copy().reset_index().drop('index',axis=1)
    
    # Determine the correlation in order to plot with different colors
    corr_list = list()
    for i in feature_list:
        b = np.corrcoef(shap_v[i],df_v[i])[1][0]
        corr_list.append(b)
    corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
 
    # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
    corr_df.columns  = ['Variable','Corr']
    corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
    
    shap_abs = np.abs(shap_v)
    k=pd.DataFrame(shap_abs.mean()).reset_index()
    k.columns = ['Variable','SHAP_abs']
    k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
    k2 = k2.sort_values(by='SHAP_abs',ascending = True)
    
    k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
    k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
    k2_f.drop(columns='Corr', inplace=True)
    k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
    
    return k2_f

foo_all = pd.DataFrame()

for k,v in list(enumerate(model.classes_)):

    foo = get_ABS_SHAP(shap_values[k], X_test)
    foo['class'] = v
    foo_all = pd.concat([foo_all,foo])

import plotly_express as px
px.bar(foo_all,x='SHAP', y='Variable', color='class')

导致

By doing some research and with the help of this post and @Alessandro Nesti 's answer, here is my solution:

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)

def get_ABS_SHAP(df_shap,df):
    #import matplotlib as plt
    # Make a copy of the input data
    shap_v = pd.DataFrame(df_shap)
    feature_list = df.columns
    shap_v.columns = feature_list
    df_v = df.copy().reset_index().drop('index',axis=1)
    
    # Determine the correlation in order to plot with different colors
    corr_list = list()
    for i in feature_list:
        b = np.corrcoef(shap_v[i],df_v[i])[1][0]
        corr_list.append(b)
    corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
 
    # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
    corr_df.columns  = ['Variable','Corr']
    corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
    
    shap_abs = np.abs(shap_v)
    k=pd.DataFrame(shap_abs.mean()).reset_index()
    k.columns = ['Variable','SHAP_abs']
    k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
    k2 = k2.sort_values(by='SHAP_abs',ascending = True)
    
    k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
    k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
    k2_f.drop(columns='Corr', inplace=True)
    k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
    
    return k2_f

foo_all = pd.DataFrame()

for k,v in list(enumerate(model.classes_)):

    foo = get_ABS_SHAP(shap_values[k], X_test)
    foo['class'] = v
    foo_all = pd.concat([foo_all,foo])

import plotly_express as px
px.bar(foo_all,x='SHAP', y='Variable', color='class')

which results in

回复收藏 0 原文

坦然微笑 2025-01-26 03:28:40

这是 @ pent 的更新代码：

import pandas as pd
import random

import numpy as np

import xgboost
import shap

from sklearn.model_selection import train_test_split

import plotly_express as px


foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

foo['class'], _ = pd.factorize(foo['class'], sort = True)

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)

model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)

shap_values = shap.TreeExplainer(model).shap_values(X_test)







def get_ABS_SHAP(df_shap,df):
    #import matplotlib as plt
    # Make a copy of the input data
    shap_v = pd.DataFrame(df_shap)
    feature_list = df.columns
    shap_v.columns = feature_list
    df_v = df.copy().reset_index().drop('index',axis=1)
    
    # Determine the correlation in order to plot with different colors
    corr_list = list()
    for i in feature_list:
        b = np.corrcoef(shap_v[i],df_v[i])[1][0]
        corr_list.append(b)
    corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
 
    # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
    corr_df.columns  = ['Variable','Corr']
    corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
    
    shap_abs = np.abs(shap_v)
    k=pd.DataFrame(shap_abs.mean()).reset_index()
    k.columns = ['Variable','SHAP_abs']
    k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
    k2 = k2.sort_values(by='SHAP_abs',ascending = True)
    
    k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
    k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
    k2_f.drop(columns='Corr', inplace=True)
    k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
    
    return k2_f

foo_all = pd.DataFrame()

for k,v in list(enumerate(model.classes_)):

    foo = get_ABS_SHAP(shap_values[k], X_test)
    foo['class'] = v
    foo_all = pd.concat([foo_all,foo])

px.bar(foo_all,x='SHAP', y='Variable', color='class')

This is an updated code of @quant's code:

import pandas as pd
import random

import numpy as np

import xgboost
import shap

from sklearn.model_selection import train_test_split

import plotly_express as px


foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

foo['class'], _ = pd.factorize(foo['class'], sort = True)

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)

model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)

shap_values = shap.TreeExplainer(model).shap_values(X_test)







def get_ABS_SHAP(df_shap,df):
    #import matplotlib as plt
    # Make a copy of the input data
    shap_v = pd.DataFrame(df_shap)
    feature_list = df.columns
    shap_v.columns = feature_list
    df_v = df.copy().reset_index().drop('index',axis=1)
    
    # Determine the correlation in order to plot with different colors
    corr_list = list()
    for i in feature_list:
        b = np.corrcoef(shap_v[i],df_v[i])[1][0]
        corr_list.append(b)
    corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
 
    # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
    corr_df.columns  = ['Variable','Corr']
    corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
    
    shap_abs = np.abs(shap_v)
    k=pd.DataFrame(shap_abs.mean()).reset_index()
    k.columns = ['Variable','SHAP_abs']
    k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
    k2 = k2.sort_values(by='SHAP_abs',ascending = True)
    
    k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
    k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
    k2_f.drop(columns='Corr', inplace=True)
    k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
    
    return k2_f

foo_all = pd.DataFrame()

for k,v in list(enumerate(model.classes_)):

    foo = get_ABS_SHAP(shap_values[k], X_test)
    foo['class'] = v
    foo_all = pd.concat([foo_all,foo])

px.bar(foo_all,x='SHAP', y='Variable', color='class')

回复收藏 0 原文

苍白女子 2025-01-26 03:28:40

自定义解决方案是一个过度复杂的，恕我直言。

解决方案

shap.summary_plot(shap_values, X_test, class_inds="original", class_names=model.classes_)

说明

将类名称传递给summary_plot。这必须反映预测的顺序。由于一个先验的人不知道该顺序，因此通常可以将model.classes _用于此目的；
指示shap坚持预测的原始顺序，而不是对它们进行排序：class_inds =“原始”（请参阅相关代码“ nofollow noreferrer”>在这里）。

ps i使用shap 0.40.0

pps我无法运行您的示例，因为我的XGBoost版本不允许使用字符串作为目标类别。但是它可以与标签编码的目标或其他模型类型（sklearn.randomforestclassifier或lgb.lgb.lgb.lgbmclassifier）一起使用。

The custom solution is an over-complication, IMHO.

Solution

shap.summary_plot(shap_values, X_test, class_inds="original", class_names=model.classes_)

Explanation

pass over the class names to summary_plot. This has to reflect the order of predictions. Since one a priori doesn't know the order, then typically one can use model.classes_ for that purpose;
instruct shap to stick to the original order of predictions instead of sorting them: class_inds="original" (see the relevant code here).

P.S. I use shap 0.40.0

P.P.S. I was not able to run your example as my version of XGBoost doesn't allow to use strings as target categories. But it works with label-encoded target or with other model types (sklearn.RandomForestClassifier or lgb.LGBMClassifier)

回复收藏 0 原文

千柳 2025-01-26 03:28:40

首先，您需要使用LabelEncoder，然后使用classes_

import pandas as pd
import random

import xgboost
import shap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
               'var1':random.sample(range(1, 100), 10),
               'var2':random.sample(range(1, 100), 10),
               'var3':random.sample(range(1, 100), 10),
               'class': ['a','a','a','a','a','b','b','c','c','c']})

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                    foo[['class']],
                                                    test_size=0.33, 
                                                    random_state=42)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train.values.ravel())
y_test_encoded = label_encoder.transform(y_test.values.ravel())

model = xgboost.XGBClassifier(objective="multi:softprob", 
                              num_class=len(label_encoder.classes_))
model.fit(X_train, y_train_encoded)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
classes = label_encoder.inverse_transform(range(
                            len(label_encoder.classes_)))
shap.summary_plot(shap_values, X_test, class_names=classes)

First, you need to use LabelEncoder and then classes_

import pandas as pd
import random

import xgboost
import shap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
               'var1':random.sample(range(1, 100), 10),
               'var2':random.sample(range(1, 100), 10),
               'var3':random.sample(range(1, 100), 10),
               'class': ['a','a','a','a','a','b','b','c','c','c']})

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                    foo[['class']],
                                                    test_size=0.33, 
                                                    random_state=42)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train.values.ravel())
y_test_encoded = label_encoder.transform(y_test.values.ravel())

model = xgboost.XGBClassifier(objective="multi:softprob", 
                              num_class=len(label_encoder.classes_))
model.fit(X_train, y_train_encoded)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
classes = label_encoder.inverse_transform(range(
                            len(label_encoder.classes_)))
shap.summary_plot(shap_values, X_test, class_names=classes)