使用 embedd 在 tidymodels 框架中进行目标编码

发布于 2025-01-09 01:07:18 字数 467 浏览 0 评论 0原文

我想对级别太多的分类变量进行目标编码。

我看过这个 vignette ,它提出了以下方法来对变量进行目标编码:

step_lencode_glm()
step_lencode_bayes() 
step_lencode_mixed()

这三种方法使用所有记录来创建估计,其中往往会过度拟合该列

使用 tidymodels,有没有一种简单的方法可以将我的训练集分割为 5 倍,并从其他 4 倍中获取目标编码?

谢谢

I would like to do target encoding for a categorical variable with too many levels.

I have seen this vignette , which proposes the following approach to target encode a variable:

step_lencode_glm()
step_lencode_bayes() 
step_lencode_mixed()

The three approaches use all the records to create the estimates, which tends to overfit to that column.

Using tidymodels, is there an easy way to split my training set 5 folds and get the target encoding from the other 4 folds?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

姐不稀罕 2025-01-16 01:07:18

如果您使用像 fit_resamples() 这样的函数,这正是会发生的情况;您将获得拟合 n - 1 折叠并评估最后折叠的性能估计值。

如果您想更详细地探索这一点,您可以按照此小插图进行操作

library(tidymodels)
library(embed)

data(grants, package = "modeldata")

set.seed(1)
folds <- vfold_cv(grants_other, v = 3)
folds
#> #  3-fold cross-validation 
#> # A tibble: 3 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [5460/2730]> Fold1
#> 2 <split [5460/2730]> Fold2
#> 3 <split [5460/2730]> Fold3

rec <- 
  recipe(class ~ sponsor_code, data = grants_other) %>%
  step_lencode_glm(sponsor_code, outcome = vars(class))

res <-
  folds %>%
  mutate(recipe = map(splits, prepper, recipe = rec),
         processed = map(recipe, tidy, number = 1))

res %>%
  select(fold_id = id, processed) %>%
  unnest(processed)
#> # A tibble: 757 × 5
#>    fold_id level   value terms        id               
#>    <chr>   <chr>   <dbl> <chr>        <chr>            
#>  1 Fold1   100D    0.288 sponsor_code lencode_glm_gfHLA
#>  2 Fold1   101A   -1.50  sponsor_code lencode_glm_gfHLA
#>  3 Fold1   103C   -1.95  sponsor_code lencode_glm_gfHLA
#>  4 Fold1   105A   -1.39  sponsor_code lencode_glm_gfHLA
#>  5 Fold1   107C   16.6   sponsor_code lencode_glm_gfHLA
#>  6 Fold1   10B    16.6   sponsor_code lencode_glm_gfHLA
#>  7 Fold1   111C  -16.6   sponsor_code lencode_glm_gfHLA
#>  8 Fold1   112D    0.560 sponsor_code lencode_glm_gfHLA
#>  9 Fold1   113A    0.223 sponsor_code lencode_glm_gfHLA
#> 10 Fold1   118B    0     sponsor_code lencode_glm_gfHLA
#> # … with 747 more rows

reprex 软件包 (v2.0.1) 于 2022 年 2 月 22 日创建

建议像这样重新采样来估计嵌入策略的性能,然后使用整个训练集来适应最终的嵌入。

That is exactly what will happen if you use a function like fit_resamples(); you will get an estimate for performance from fitting to n - 1 folds and evaluating on the last fold.

If you want to explore this in more detail, you can follow along with this vignette.

library(tidymodels)
library(embed)

data(grants, package = "modeldata")

set.seed(1)
folds <- vfold_cv(grants_other, v = 3)
folds
#> #  3-fold cross-validation 
#> # A tibble: 3 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [5460/2730]> Fold1
#> 2 <split [5460/2730]> Fold2
#> 3 <split [5460/2730]> Fold3

rec <- 
  recipe(class ~ sponsor_code, data = grants_other) %>%
  step_lencode_glm(sponsor_code, outcome = vars(class))

res <-
  folds %>%
  mutate(recipe = map(splits, prepper, recipe = rec),
         processed = map(recipe, tidy, number = 1))

res %>%
  select(fold_id = id, processed) %>%
  unnest(processed)
#> # A tibble: 757 × 5
#>    fold_id level   value terms        id               
#>    <chr>   <chr>   <dbl> <chr>        <chr>            
#>  1 Fold1   100D    0.288 sponsor_code lencode_glm_gfHLA
#>  2 Fold1   101A   -1.50  sponsor_code lencode_glm_gfHLA
#>  3 Fold1   103C   -1.95  sponsor_code lencode_glm_gfHLA
#>  4 Fold1   105A   -1.39  sponsor_code lencode_glm_gfHLA
#>  5 Fold1   107C   16.6   sponsor_code lencode_glm_gfHLA
#>  6 Fold1   10B    16.6   sponsor_code lencode_glm_gfHLA
#>  7 Fold1   111C  -16.6   sponsor_code lencode_glm_gfHLA
#>  8 Fold1   112D    0.560 sponsor_code lencode_glm_gfHLA
#>  9 Fold1   113A    0.223 sponsor_code lencode_glm_gfHLA
#> 10 Fold1   118B    0     sponsor_code lencode_glm_gfHLA
#> # … with 747 more rows

Created on 2022-02-22 by the reprex package (v2.0.1)

We would recommend resampling like this to estimate the performance of an embedding strategy, and then the whole training set to fit the final embedding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文