如何使用Shap从ML模型中提取最重要的功能 - 为什么我的所有列名称为空?
我想使用Shap在模型中找到最重要的功能。
我有此代码:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import shap
import pandas as pd
import numpy as np
#loading and preparing the data
iris = load_breast_cancer()
X = iris.data
y = iris.target
columns = iris.feature_names
#if you don't shuffle you wont need to keep track of test_index, but I think
#it is always good practice to shuffle your data
kf = KFold(n_splits=2,shuffle=True)
list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train = pd.DataFrame(X_train,columns=columns)
X_test = pd.DataFrame(X_test,columns=columns)
#training model
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
#explaining model
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)
#for each iteration we save the test_set index and the shap_values
list_shap_values.append(shap_values)
list_test_sets.append(test_index)
#combining results from all iterations
test_set = list_test_sets[0]
shap_values = np.array(list_shap_values[0])
for i in range(1,len(list_test_sets)):
test_set = np.concatenate((test_set,list_test_sets[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(list_shap_values[i])),axis=1)
#bringing back variable names
X_test = pd.DataFrame(X[test_set],columns=columns)
#creating explanation plot for the whole experiment, the first dimension from shap_values indicate the class we are predicting (0=0, 1=1)
#shap.summary_plot(shap_values[1], X_test)
shap_sum = np.abs(shap_values).mean(axis=0)
#columns = full_X_train.columns
X_test = pd.DataFrame(X[test_set],columns=columns)
importance_df = pd.DataFrame([X_test.columns.tolist(),shap_sum.tolist()]).T
importance_df.columns = ['column_name','shap_importance']
importance_df = importance_df.sort_values('shap_importance',ascending=False)
print(importance_df)
输出是:
390 None [0.07973283098297632, 0.012745693741197047, 0....
477 None [0.07639585953247056, 0.012705549054148915, 0....
542 None [0.07263038600009886, 0.004509187889530952, 0....
359 None [0.07006782821092902, 0.008022265024270826, 0....
292 None [0.06501143916982145, 0.014648801487419996, 0....
.. ... ...
129 None [0.001207252383050206, 0.005154096692481416, 0...
68 None [0.000537261423323933, 0.000554437257101772, 0...
229 None [0.00046312350178067416, 0.0171676941721087, 0...
94 None [0.00016002701188627102, 0.015384623641506117,...
97 None [0.0001434577248065334, 0.01162161896706629, 0...
这是不正确的,列的名称全部不正确,而且我不清楚塑形值是什么(我期望每个列的数字从顶部最重要的功能排名正在打印的内容 - 不是列表)。
我希望能有更多类似的东西:
Column Shap value
Age 0.3
Gender 0.2
有人可以向我展示我出错的地方,以及如何使用此方法列出我的模型的重要功能?
I want to find the most important features in my model using shap.
I have this code:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import shap
import pandas as pd
import numpy as np
#loading and preparing the data
iris = load_breast_cancer()
X = iris.data
y = iris.target
columns = iris.feature_names
#if you don't shuffle you wont need to keep track of test_index, but I think
#it is always good practice to shuffle your data
kf = KFold(n_splits=2,shuffle=True)
list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train = pd.DataFrame(X_train,columns=columns)
X_test = pd.DataFrame(X_test,columns=columns)
#training model
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
#explaining model
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)
#for each iteration we save the test_set index and the shap_values
list_shap_values.append(shap_values)
list_test_sets.append(test_index)
#combining results from all iterations
test_set = list_test_sets[0]
shap_values = np.array(list_shap_values[0])
for i in range(1,len(list_test_sets)):
test_set = np.concatenate((test_set,list_test_sets[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(list_shap_values[i])),axis=1)
#bringing back variable names
X_test = pd.DataFrame(X[test_set],columns=columns)
#creating explanation plot for the whole experiment, the first dimension from shap_values indicate the class we are predicting (0=0, 1=1)
#shap.summary_plot(shap_values[1], X_test)
shap_sum = np.abs(shap_values).mean(axis=0)
#columns = full_X_train.columns
X_test = pd.DataFrame(X[test_set],columns=columns)
importance_df = pd.DataFrame([X_test.columns.tolist(),shap_sum.tolist()]).T
importance_df.columns = ['column_name','shap_importance']
importance_df = importance_df.sort_values('shap_importance',ascending=False)
print(importance_df)
The output is:
390 None [0.07973283098297632, 0.012745693741197047, 0....
477 None [0.07639585953247056, 0.012705549054148915, 0....
542 None [0.07263038600009886, 0.004509187889530952, 0....
359 None [0.07006782821092902, 0.008022265024270826, 0....
292 None [0.06501143916982145, 0.014648801487419996, 0....
.. ... ...
129 None [0.001207252383050206, 0.005154096692481416, 0...
68 None [0.000537261423323933, 0.000554437257101772, 0...
229 None [0.00046312350178067416, 0.0171676941721087, 0...
94 None [0.00016002701188627102, 0.015384623641506117,...
97 None [0.0001434577248065334, 0.01162161896706629, 0...
This isn't correct, the column names are all None, and it's not clear to me what the shap values are (I was expecting one number for each column ranked from most important features at the top of what's being printed - not a list).
I was hoping for something more like:
Column Shap value
Age 0.3
Gender 0.2
Could someone show me where I went wrong, and how to list the important features for my model using this method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您检查:
并且
您会发现最后2个轴上的一个令人惊讶的巧合,这不是偶然发生的:
然后,问一个问题“平均而言,最有影响力的功能是由沙普利贡献所判断的最有影响力的功能”,您会得到:
<
a href =“ https://i.sstatic.net/kczng.png” rel = “ nofollow noreferrer”>
If you check:
and
you'll find out a surprising coincidence on the last 2 axes, which happened not by chance:
Then, asking the question "what, on average, are the most influential features judged by Shapley contributions" you'll get:
Same as: