使用Shap软件包在数据框中获取功能的瀑布图值

发布于 2025-01-19 06:05:08 字数 1995 浏览 1 评论 0原文

我正在使用随机森林模型(神经网络)进行二元分类,其中正在使用Shap来解释模型预测。我遵循教程,并编写了以下代码,

以在Sergey Bushmanaov的So Post 在这里,我设法将瀑布图导出到DataFrame。但这不会复制列的特征值。它仅复制形状值,预期_value和特征名称。但是我也想要特征名称。所以,我尝试了下面

shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())

,但这丢了一个错误

TypeError:Waterfall()有一个意外的关键字参数 'Base_values'

我希望我的输出如下。我使用1点的背景来计算基本价值。但是您也可以免费使用背景1,10或100。在下面的输出中,我将值和功能存储在一个名为功能的列中。这是类似于lime的东西。但是不确定Shap是否具有这种灵活性?

更新 - plot

“在此处输入图像说明”

更新代码 - 内核解释器瀑布到dataframe

masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)

sdf_train = pd.DataFrame({
    'row_id': X_train.index.values.repeat(X_train.shape[1]),
    'feature': X_train.columns.to_list() * X_train.shape[0],
    'feature_value': X_train.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()   # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below

With the help of Sergey Bushmanaov's SO post here, I managed to export the waterfall plot to dataframe. But this doesn't copy the feature values of the columns. It only copies the shap values, expected_value and feature names. But I want feature names as well. So, I tried the below

shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())

but this threw an error

TypeError: waterfall() got an unexpected keyword argument
'base_values'

I expect my output to be like as below. I have used background of 1 point to compute base value. But you are free to use background 1,10 or 100 as well. In the below output, I have stored the values and feature in one column called Feature. This is something similar to LIME. But not sure whether SHAP has this flexibility to do this?

enter image description here

update - plot

enter image description here

update code - kernel explainer waterfall to dataframe

masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)

sdf_train = pd.DataFrame({
    'row_id': X_train.index.values.repeat(X_train.shape[1]),
    'feature': X_train.columns.to_list() * X_train.shape[0],
    'feature_value': X_train.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()   # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无名指的心愿 2025-01-26 06:05:08

尝试以下操作:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall

import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1], 
                  sv.base_values[:,1], 
                  data=X.values, 
                  feature_names=X.columns)
idx = 0
waterfall(exp[idx])

0.39.0

“在此处输入图像描述”"

然后:

pd.DataFrame({
    'row_id':idx,
    'feature': X.columns,
    'feature_value': exp[idx].values,
    'base_value': exp[idx].base_values,
    'shap_values': exp[idx].values
})

#expected output
row_id  feature feature_value   base_value  shap_values
0   0   mean radius -0.035453   0.628998    -0.035453
1   0   mean texture    0.047571    0.628998    0.047571
2   0   mean perimeter  -0.036218   0.628998    -0.036218
3   0   mean area   -0.041276   0.628998    -0.041276
4   0   mean smoothness -0.006842   0.628998    -0.006842
5   0   mean compactness    -0.009275   0.628998    -0.009275
6   0   mean concavity  -0.035188   0.628998    -0.035188
7   0   mean concave points -0.051165   0.628998    -0.051165
8   0   mean symmetry   -0.002192   0.628998    -0.002192
9   0   mean fractal dimension  0.001521    0.628998    0.001521
10  0   radius error    -0.021223   0.628998    -0.021223
11  0   texture error   -0.000470   0.628998    -0.000470
12  0   perimeter error -0.021423   0.628998    -0.021423
13  0   area error  -0.035313   0.628998    -0.035313
14  0   smoothness error    -0.000060   0.628998    -0.000060
15  0   compactness error   0.001053    0.628998    0.001053
16  0   concavity error -0.002988   0.628998    -0.002988
17  0   concave points error    0.000140    0.628998    0.000140
18  0   symmetry error  0.001238    0.628998    0.001238
19  0   fractal dimension error -0.001097   0.628998    -0.001097
20  0   worst radius    -0.050027   0.628998    -0.050027
21  0   worst texture   0.038056    0.628998    0.038056
22  0   worst perimeter -0.079717   0.628998    -0.079717
23  0   worst area  -0.072312   0.628998    -0.072312
24  0   worst smoothness    -0.006917   0.628998    -0.006917
25  0   worst compactness   -0.016184   0.628998    -0.016184
26  0   worst concavity -0.022500   0.628998    -0.022500
27  0   worst concave points    -0.088697   0.628998    -0.088697
28  0   worst symmetry  -0.026166   0.628998    -0.026166
29  0   worst fractal dimension -0.007683   0.628998    -0.007683

RandomForest 有点特殊,这就是原因。当新的绘图 API 出现问题时,请尝试提供 Explanation 对象。

更新

解释单个数据点exp_id与单个背景数据点back_id(即回答问题“为什么预测exp_id > 与 back_id 的预测不同”):

back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])

在此处输入图像描述

最后,正如您要求以建议格式提供所有内容:

from shap.maskers import Independent
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)

masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)

pd.DataFrame({
    'row_id': X_test.index.values.repeat(X_test.shape[1]),
    'feature': X_test.columns.to_list() * X_test.shape[0],
    'feature_value': X_test.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()
})

但我绝对不会显示此内容给我妈妈。

Try following:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall

import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1], 
                  sv.base_values[:,1], 
                  data=X.values, 
                  feature_names=X.columns)
idx = 0
waterfall(exp[idx])

0.39.0

enter image description here

Then:

pd.DataFrame({
    'row_id':idx,
    'feature': X.columns,
    'feature_value': exp[idx].values,
    'base_value': exp[idx].base_values,
    'shap_values': exp[idx].values
})

#expected output
row_id  feature feature_value   base_value  shap_values
0   0   mean radius -0.035453   0.628998    -0.035453
1   0   mean texture    0.047571    0.628998    0.047571
2   0   mean perimeter  -0.036218   0.628998    -0.036218
3   0   mean area   -0.041276   0.628998    -0.041276
4   0   mean smoothness -0.006842   0.628998    -0.006842
5   0   mean compactness    -0.009275   0.628998    -0.009275
6   0   mean concavity  -0.035188   0.628998    -0.035188
7   0   mean concave points -0.051165   0.628998    -0.051165
8   0   mean symmetry   -0.002192   0.628998    -0.002192
9   0   mean fractal dimension  0.001521    0.628998    0.001521
10  0   radius error    -0.021223   0.628998    -0.021223
11  0   texture error   -0.000470   0.628998    -0.000470
12  0   perimeter error -0.021423   0.628998    -0.021423
13  0   area error  -0.035313   0.628998    -0.035313
14  0   smoothness error    -0.000060   0.628998    -0.000060
15  0   compactness error   0.001053    0.628998    0.001053
16  0   concavity error -0.002988   0.628998    -0.002988
17  0   concave points error    0.000140    0.628998    0.000140
18  0   symmetry error  0.001238    0.628998    0.001238
19  0   fractal dimension error -0.001097   0.628998    -0.001097
20  0   worst radius    -0.050027   0.628998    -0.050027
21  0   worst texture   0.038056    0.628998    0.038056
22  0   worst perimeter -0.079717   0.628998    -0.079717
23  0   worst area  -0.072312   0.628998    -0.072312
24  0   worst smoothness    -0.006917   0.628998    -0.006917
25  0   worst compactness   -0.016184   0.628998    -0.016184
26  0   worst concavity -0.022500   0.628998    -0.022500
27  0   worst concave points    -0.088697   0.628998    -0.088697
28  0   worst symmetry  -0.026166   0.628998    -0.026166
29  0   worst fractal dimension -0.007683   0.628998    -0.007683

RandomForest is a bit special, this is why. When something fails with the new plots API, try to feed Explanation object.

UPDATE

To explain a single datapoint exp_id vs a single background datapoint back_id (i.e. to answer question "why prediction for exp_id differes from prediction for back_id"):

back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])

enter image description here

Finally, as you asked for everything in the suggested format:

from shap.maskers import Independent
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)

masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)

pd.DataFrame({
    'row_id': X_test.index.values.repeat(X_test.shape[1]),
    'feature': X_test.columns.to_list() * X_test.shape[0],
    'feature_value': X_test.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()
})

but I'd definitely not show this to my mom.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文