DALEX 和 step_pca

发布于 2025-01-11 08:58:50 字数 3075 浏览 0 评论 0原文

我想使用 DALEX model_parts 查看主成分的复合特征重要性,但我也感兴趣的是结果在多大程度上是由该主成分中特定变量的变化驱动的。使用 model_profile 时,我可以非常清晰地查看各个特征的影响,但在这种情况下,我无法研究 PCA 变量的特征重要性。是否可以在使用如下所示的各个因素的 model_profile 部分依赖图的同时,获得两全其美的效果并查看主成分的复合特征重要性?

数据:

library(tidymodels)
library(parsnip)
library(DALEXtra)

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
# id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
df[, c("x1", "x2", "x3", "x4", "id")] <- sapply(df[, c("x1", "x2", "x3", "x4", "id")], as.numeric)

模型

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
  update_role(id, new_role = "id variable") %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(x1, x2, x3, threshold = 0.9, num_comp = turn_off_pca)

# parsnip engine
boost_model <- boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

# create wf
boosted_wf <- 
  workflow() %>% 
  add_model(boost_model) %>% 
  add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 

# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted, 
                                              data = df[,-1], 
                                              y = df$y) 

# feature importance
model_parts(explainer_xgb) %>% plot()

这给了我下面的图,尽管我已将 x1x2x3 简化为 step_pca< 中的一个组件/代码> 上面。

输入图片这里的描述

我知道我可以手动减少尺寸并将其绑定到 df 像这样,然后查看特征重要性。

rec_pca_2 <- df %>% 
  select(x1, x2, x3) %>% 
  recipe() %>%
  step_pca(all_numeric(), num_comp = 1)


df <- bind_cols(df, prep(rec_pca_2) %>% juice())
df

> df
# A tibble: 1,000 × 6
   y        x1    x2    x3    x4   PC1
   <fct> <int> <int> <int> <int> <dbl>
 1 2         0     2     4     2 -4.45
 2 3         0     3     3     3 -3.95
 3 0         0     2     4     4 -4.45
 4 2         1     4     5     3 -6.27
 5 4         0     1     5     2 -4.94
 6 2         1     0     5     1 -4.63
 7 3         2     2     5     4 -5.56
 8 3         1     2     5     3 -5.45
 9 2         1     3     5     2 -5.86
10 2         0     2     5     1 -5.35
# … with 990 more rows

然后我可以估计以 PC1 作为协变量的模型。然而,在这种情况下,使用 model_profile 时很难解释 PC1 substatial 的变化意味着什么,因为所有内容都会被折叠到一个组件中。

model_profile(explainer_xgb) %>% plot()

,我的关键问题是:如何在不影响部分依赖图的可解释性的情况下查看组件的特征重要性?

I would like to look at the compound feature importance of the principal components with DALEX model_parts but I am also interested to what extent the results are driven by variation in a specific variable in this principal component. I can look at individual feature influence very neatly when using model_profile but in that case, I cannot investigate the feature importance of the PCA variables. Is it possible to get the best of both world and look at the compound feature importance of a principal component while using model_profile partial dependence plots of individual factors as shown below?

Data:

library(tidymodels)
library(parsnip)
library(DALEXtra)

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
# id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
df[, c("x1", "x2", "x3", "x4", "id")] <- sapply(df[, c("x1", "x2", "x3", "x4", "id")], as.numeric)

Model

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
  update_role(id, new_role = "id variable") %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(x1, x2, x3, threshold = 0.9, num_comp = turn_off_pca)

# parsnip engine
boost_model <- boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

# create wf
boosted_wf <- 
  workflow() %>% 
  add_model(boost_model) %>% 
  add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 

# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted, 
                                              data = df[,-1], 
                                              y = df$y) 

# feature importance
model_parts(explainer_xgb) %>% plot()

This gives me the following plot although even if I have reduced x1, x2 and x3 into one component in step_pca above.

enter image description here

I know I could reduce dimensions manually and bind it to the df like so and then look at the feature importance.

rec_pca_2 <- df %>% 
  select(x1, x2, x3) %>% 
  recipe() %>%
  step_pca(all_numeric(), num_comp = 1)


df <- bind_cols(df, prep(rec_pca_2) %>% juice())
df

> df
# A tibble: 1,000 × 6
   y        x1    x2    x3    x4   PC1
   <fct> <int> <int> <int> <int> <dbl>
 1 2         0     2     4     2 -4.45
 2 3         0     3     3     3 -3.95
 3 0         0     2     4     4 -4.45
 4 2         1     4     5     3 -6.27
 5 4         0     1     5     2 -4.94
 6 2         1     0     5     1 -4.63
 7 3         2     2     5     4 -5.56
 8 3         1     2     5     3 -5.45
 9 2         1     3     5     2 -5.86
10 2         0     2     5     1 -5.35
# … with 990 more rows

I could then estimate a model with PC1 as covariate. Yet, in that case, it would be difficult to interpret what the variation in PC1 substatial means when using model_profile since everything would be collapsed into one component.

model_profile(explainer_xgb) %>% plot()

enter image description here

Thus, my key question is: how can I look at the feature importance of components without compromising on the interpretability of the partial dependence plot?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

顾挽 2025-01-18 08:58:50

您可能对此处的讨论感兴趣,了解如何从原始预测变量与. 通过特征工程创建的特征(如 PCA 组件)。我们还没有超级流畅的界面,因此您必须手动执行此操作:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(parsnip)
library(DALEX)
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
y <- as.factor(sample(c("yes", "no"), size = 1000, replace = TRUE))
df <- tibble(y, x1, x2, x3, x4) %>% mutate(across(where(is.integer), as.numeric))

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    step_pca(x1, x2, x3, threshold = 0.9)

# parsnip engine
boost_model <- boost_tree() %>% 
    set_mode("classification") %>% 
    set_engine("xgboost")

# create wf
boosted_wf <- 
    workflow() %>% 
    add_model(boost_model) %>% 
    add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 
#> [14:00:11] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

请注意,接下来我使用常规 DALEX(不是 DALEXtra),并且我从工作流程内部手动提取 xgboost 模型并应用我自己对数据进行特征工程:

# create an explanation object
explainer_xgb <-
    DALEX::explain(
        extract_fit_parsnip(final_boosted), 
        data = rec_pca %>% prep() %>% bake(new_data = NULL, all_predictors()), 
        y = as.integer(train$y)
    ) 
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_fit  (  default  )
#>   -> data              :  800  rows  4  cols 
#>   -> data              :  tibble converted into a data.frame 
#>   -> target variable   :  800  values 
#>   -> predict function  :  yhat.model_fit  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package parsnip , ver. 0.1.7 , task classification (  default  ) 
#>   -> predicted values  :  numerical, min =  0.1157353 , mean =  0.4626758 , max =  0.8343955  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  0.1860582 , mean =  0.9985742 , max =  1.884265  
#>   A new explainer has been created!


model_parts(explainer_xgb) %>% plot()

reprex 包 (v2.0.1)

DALEXtra 目前支持的唯一行为是基于使用原始预测器因此,如果您想查看这些工程功能,您需要自己动手。您可能对我们本书的这一章感兴趣

You may be interested in the discussion here on how to get explainability from the original predictors vs. features that have been created via feature engineering (like PCA components). We don't have a super fluent interface yet, so you have to do this is a bit manually:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(parsnip)
library(DALEX)
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
y <- as.factor(sample(c("yes", "no"), size = 1000, replace = TRUE))
df <- tibble(y, x1, x2, x3, x4) %>% mutate(across(where(is.integer), as.numeric))

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    step_pca(x1, x2, x3, threshold = 0.9)

# parsnip engine
boost_model <- boost_tree() %>% 
    set_mode("classification") %>% 
    set_engine("xgboost")

# create wf
boosted_wf <- 
    workflow() %>% 
    add_model(boost_model) %>% 
    add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 
#> [14:00:11] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Notice that next here I use regular DALEX (not DALEXtra), and that I manually extract out the xgboost model from inside the workflow and apply the feature engineering to the data myself:

# create an explanation object
explainer_xgb <-
    DALEX::explain(
        extract_fit_parsnip(final_boosted), 
        data = rec_pca %>% prep() %>% bake(new_data = NULL, all_predictors()), 
        y = as.integer(train$y)
    ) 
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_fit  (  default  )
#>   -> data              :  800  rows  4  cols 
#>   -> data              :  tibble converted into a data.frame 
#>   -> target variable   :  800  values 
#>   -> predict function  :  yhat.model_fit  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package parsnip , ver. 0.1.7 , task classification (  default  ) 
#>   -> predicted values  :  numerical, min =  0.1157353 , mean =  0.4626758 , max =  0.8343955  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  0.1860582 , mean =  0.9985742 , max =  1.884265  
#>   A new explainer has been created!


model_parts(explainer_xgb) %>% plot()

Created on 2022-03-11 by the reprex package (v2.0.1)

The only behavior supported right now in DALEXtra is based on using the original predictors so if you want to look at those engineered features, you need to do it yourself. You may be interested in this chapter of our book.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文