使用 R 中的 tidymodels 获取 catboost 模型的摘要形状图
我正在尝试在 tidymodels 框架内构建一个 catboost 模型。下面给出了最小的可重现示例。我可以使用 DALEX
和 modelStudio
软件包来获取模型解释,但我想创建 VIP 绘图 像这样和总结形状图喜欢这个对于这个catboost模型。我尝试过像 fastshap
、SHAPforxgboost
这样的软件包,但没有任何运气。我意识到我必须从 model
对象中提取变量重要性和形状值,并使用它们来生成这些图,但不知道该怎么做。有没有办法在 R 中完成这个工作?
library(tidymodels)
library(treesnip)
library(catboost)
library(modelStudio)
library(DALEXtra)
library(DALEX)
data <- structure(list(Age = c(74, 60, 57, 53, 72, 72, 71, 77, 50, 66), StatusofNation0developed = structure(c(2L, 2L, 2L, 2L, 2L,
1L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
treatment = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 1L,
2L), .Label = c("0", "1", "2"), class = "factor"), InHospitalMortalityMortality = c(0,
0, 1, 1, 1, 0, 0, 1, 1, 0)), row.names = c(NA, 10L), class = "data.frame")
split <- initial_split(data, strata = InHospitalMortalityMortality)
train <- training(split)
test <- testing(split)
train$InHospitalMortalityMortality <- as.factor(train$InHospitalMortalityMortality)
rec <- recipe(InHospitalMortalityMortality ~ ., data = train)
clf <- boost_tree() %>%
set_engine("catboost") %>%
set_mode("classification")
wflow <- workflow() %>%
add_recipe(rec) %>%
add_model(clf)
model <- wflow %>% fit(data = train)
explainer <- explain_tidymodels(model,
data = test,
y = test$InHospitalMortalityMortality,
label = "catboost")
new_observation <- test[1:2,]
modelStudio(explainer, new_observation)
I am trying to build a catboost model within the tidymodels framework. Minimal reproducible example is given below. I am able to use the DALEX
and modelStudio
packages to get model explanations but I want to create VIP plots like this and summary shap plots like this for this catboost model. I have tried packages like fastshap
, SHAPforxgboost
without any luck. I realise that i have to extract the variable importance and shap values from the model
object and use them to produce these plots but dont know how to do that. Is there a way to get this done in R?
library(tidymodels)
library(treesnip)
library(catboost)
library(modelStudio)
library(DALEXtra)
library(DALEX)
data <- structure(list(Age = c(74, 60, 57, 53, 72, 72, 71, 77, 50, 66), StatusofNation0developed = structure(c(2L, 2L, 2L, 2L, 2L,
1L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
treatment = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 1L,
2L), .Label = c("0", "1", "2"), class = "factor"), InHospitalMortalityMortality = c(0,
0, 1, 1, 1, 0, 0, 1, 1, 0)), row.names = c(NA, 10L), class = "data.frame")
split <- initial_split(data, strata = InHospitalMortalityMortality)
train <- training(split)
test <- testing(split)
train$InHospitalMortalityMortality <- as.factor(train$InHospitalMortalityMortality)
rec <- recipe(InHospitalMortalityMortality ~ ., data = train)
clf <- boost_tree() %>%
set_engine("catboost") %>%
set_mode("classification")
wflow <- workflow() %>%
add_recipe(rec) %>%
add_model(clf)
model <- wflow %>% fit(data = train)
explainer <- explain_tidymodels(model,
data = test,
y = test$InHospitalMortalityMortality,
label = "catboost")
new_observation <- test[1:2,]
modelStudio(explainer, new_observation)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
上面的链接提供了答案,但不完整。遵循相同的工作流程,到这里就完成了。
如图所示:首先,安装 R 软件包 {fastshap} 和 {reticulate}。接下来,使用 {reticulate} 设置一个供 python 使用的虚拟环境。使用 RStudio 时设置虚拟环境相对简单。请查看他们的参考资料以获取分步说明。
然后,在 venv 中 pip install {shap} 和 {matplotlib} - 请注意,matplotlib 3.2.2 对于摘要图似乎是必需的(有关更多详细信息,请参阅 GitHub issues)。
工作流程(来自 treesnip 文档):
拟合工作流程:
通过拟合工作流程,我们现在可以通过 {fastshap} 创建形状值,并使用 {fastshap} 和 {reticulate} 进行绘图。
首先,力图:为此,我们需要为 pred_wrapper 参数创建一个预测函数。
现在我们需要基线参数的平均预测值。
在这里,创建形状值:
现在,对于力图:
现在对于摘要图:使用 {reticulate} 直接访问函数:
例如,这同样适用于依赖图。
最后注意:重复渲染将导致可视化出现错误。在 dependency_plot 中直接命名一个特征(即“剪切”)给我带来了一个错误。
The link above provides an answer, but it is incomplete. Here it is completed, following an identical workflow.
As indicated: first, install R packages {fastshap} and and {reticulate}. Next, setup a virtual environment for python use with {reticulate}. Setting up a virtual environment is relatively straightforward when using RStudio. Please check their reference material for step by step instructions.
Then, pip install {shap} and {matplotlib} in venv -- note that matplotlib 3.2.2 would seem necessary for summary plots (see GitHub issues for greater detail).
The workflow (from treesnip docs):
Fit the workflow:
With a fit workflow, we can now create shap values via {fastshap} and plot with {fastshap} and {reticulate}.
First, the force plots: to do this, we need to create a prediction function for the pred_wrapper argument.
Now we want the mean prediction values for the baseline argument.
Here, create the shap values:
Now, for the force plot:
Now for the summary plot: use {reticulate} to access function directly:
The same would work for dependency plots, for example.
Final note: repeated rendering will result in buggy visualizations. Naming a feature directly (i.e., "cut") in dependence_plot threw me an error.
首先,我们需要从模型对象中提取工作流程,并用它来预测测试集。(可选)使用
catboost.load_pool
函数,我们创建池对象之后使用
catboost.get_feature_importance
函数我们获取模型对象的特征重要性分数。然后我们可以使用 function
type = 'ShapValues'
选项获取 shapvalues。最后绘制 shapvalues
First we need to extract the workflow from the model object and use it to predict on the test set.(optional) The used the
catboost.load_pool
function we create the pool objectAfter this using the
catboost.get_feature_importance
function we get the feature importance scores on the model object.Then we can get the shapvalues using the function
type = 'ShapValues'
option.Finally plot the shapvalues