如何输出概率中的 Shap 值并从二元分类器制作force_plot

发布于 2025-01-13 22:40:42 字数 1151 浏览 3 评论 0原文

我需要绘制每个特征如何影响我的 LightGBM 二元分类器中每个样本的预测概率。所以我需要以概率的形式输出Shap值，而不是正常的Shap值。它似乎没有任何概率输出选项。

下面的示例代码是我用来生成 Shap 值的数据帧并为第一个数据样本执行 force_plot 的代码。有谁知道我应该如何修改代码来改变输出？我是 Shap 值和 Shap 包的新手。预先非常感谢。

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)


explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]

shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)

# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

原文

I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.

The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output?
I'm new to Shap value and the Shap package. Thanks a lot in advance.

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)


explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]

shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)

# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

错々过的事 2025-01-20 22:40:42

TL;DR：

您可以使用 force_plot 方法中的 link="logit" 在概率空间中获得绘图结果：

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]

shap.force_plot(
    base_value=expected_value,
    shap_values=shap_value,
    features=X_train.iloc[row_idx, :],
    link="logit",
)

预期输出：

或者，您可以通过以下明确实现相同的效果指定您有兴趣解释的 model_output="probability"：

explainer = shap.TreeExplainer(
    model,
    data=X_train,
    feature_perturbation="interventional",
    model_output="probability",
)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0

shap_value = shap_values.values[row_idx]

shap.force_plot(
    base_value=expected_value, 
    shap_values=shap_value, 
    features=X_train.iloc[row_idx, :]
)

预期输出：

< img src="https://i.sstatic.net/YJRYv.png" alt="在此处输入图像描述">

但是，了解此处发生的情况以找出这些数字的来源可能更有趣来自：

我们的目标概率兴趣点：

model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])

来自给定 X_train 作为背景的模型的基本案例原始数据（注意，LightGBM 输出类 1 的原始数据）：

model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577

来自 X_train 的原始基本案例code>SHAP （注意，它们是对称的）：

bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519,  2.48397519])

感兴趣点的原始 SHAP 值：

sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584,  3.40619584])

从 SHAP 值推断的 Proba（通过 sigmoid）：

shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])

检查：

assert np.allclose(model_proba, shap_proba)

如果有不清楚的地方，请提问。

旁注

如果您正在分析不同特征的原始尺寸效应，Proba 可能会产生误导，因为 sigmoid 是非线性的，并且在达到特定阈值后会饱和。

有些人希望在概率空间中看到 SHAP 值，但这是不可行的，因为：
SHAP 值是通过构造进行加法的（准确地说，Shapley 加法解释是对所有可能的特征联盟的平均边际贡献）
exp(a + b) != exp(a) + exp(b)

您可能会发现有用：

二元分类中的特征重要性并仅提取其中一个类的 SHAP 值答案< /p>
使用SHAP时如何解释GBT分类器的base_value？答案< /p>

TL;DR:

You can achieve plotting results in probability space with link="logit" in the force_plot method:

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]

shap.force_plot(
    base_value=expected_value,
    shap_values=shap_value,
    features=X_train.iloc[row_idx, :],
    link="logit",
)

Expected output:

Alternatively, you may achieve the same with the following, explicitly specifying model_output="probability" you're interested in to explain:

explainer = shap.TreeExplainer(
    model,
    data=X_train,
    feature_perturbation="interventional",
    model_output="probability",
)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0

shap_value = shap_values.values[row_idx]

shap.force_plot(
    base_value=expected_value, 
    shap_values=shap_value, 
    features=X_train.iloc[row_idx, :]
)

Expected output:

However, it might be more interesting for understanding what's happening here to find out where these figures come from:

Our target proba for the point of interest:

model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])

Base case raw from model given X_train as background (note, LightGBM outputs raw for class 1):

model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577

Base case raw from SHAP (note, they are symmetric):

bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519,  2.48397519])

Raw SHAP values for the point of interest:

sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584,  3.40619584])

Proba inferred from SHAP values (via sigmoid):

shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])

Check:

assert np.allclose(model_proba, shap_proba)

Please ask questions if something is not clear.

Side notes

Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.

Some people expect to see SHAP values in probability space as well, but this is not feasible because:
SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
exp(a + b) != exp(a) + exp(b)

You may find useful:

Feature importance in a binary classification and extracting SHAP values for one of the classes only answer
How to interpret base_value of GBT classifier when using SHAP? answer

回复收藏 0 原文

束缚ｍ 2025-01-20 22:40:42

您可以考虑通过 softmax() 函数运行输出值。作为参考，它被定义为：

def get_softmax_probabilities(x):
    return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)

并且还有一个 scipy 实现：

from scipy.special import softmax

softmax() 的输出将是与向量 x 中的（相对）值成比例的概率，即您的商店值。

You can consider running your output values through a softmax() function. For reference, it is defined as :

def get_softmax_probabilities(x):
    return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)

and there is a scipy implementation as well:

from scipy.special import softmax

The output from softmax() will be probabilities proportional to the (relative) values in vector x, which are your shop values.

回复收藏 0 原文

笑叹一世浮沉 2025-01-20 22:40:42

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)

# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

# verification
def verification(index_number,class_idx):
    print('-----------------------------------')
    print('index_number: ', index_number)
    print('class_idx: ', class_idx)
    print('')
    
    y_base = explainer.expected_value[class_idx]
    print('y_base: ', y_base)

    player_explainer = pd.DataFrame()
    player_explainer['feature_value'] = X_train.iloc[j].values
    player_explainer['shap_value'] = shap_values[class_idx][j]
    print('verification: ')
    print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
    print('y_pred: %.2f'%(y_train[j]))

j = 10  # index
verification(j,0)
verification(j,1)

# show: 
# X_train:  (455, 30)
# X_test:  (114, 30)
# -----------------------------------
# index_number:  10
# class_idx:  0

# y_base:  -2.391423081639827
# verification: 
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number:  10
# class_idx:  1

# y_base:  2.391423081639827
# verification: 
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value（class_idx:1 = y_pred）, and the result is obviously correct.

我帮助您实现了这一目标并验证了结果的可靠性。

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)

# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

# verification
def verification(index_number,class_idx):
    print('-----------------------------------')
    print('index_number: ', index_number)
    print('class_idx: ', class_idx)
    print('')
    
    y_base = explainer.expected_value[class_idx]
    print('y_base: ', y_base)

    player_explainer = pd.DataFrame()
    player_explainer['feature_value'] = X_train.iloc[j].values
    player_explainer['shap_value'] = shap_values[class_idx][j]
    print('verification: ')
    print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
    print('y_pred: %.2f'%(y_train[j]))

j = 10  # index
verification(j,0)
verification(j,1)

# show: 
# X_train:  (455, 30)
# X_test:  (114, 30)
# -----------------------------------
# index_number:  10
# class_idx:  0

# y_base:  -2.391423081639827
# verification: 
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number:  10
# class_idx:  1

# y_base:  2.391423081639827
# verification: 
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value（class_idx:1 = y_pred）, and the result is obviously correct.

I helped you achieve it and verified the reliability of the results.

回复收藏 0 原文

~没有更多了~