tidymodels 食谱:我可以使用 step_dummy() 对分类变量 *除了* 布尔值进行一次性编码吗?它只需要 1 个虚拟变量?

发布于 2025-01-09 09:42:17 字数 283 浏览 0 评论 0原文

如果一个分类变量有超过 2 个值(例如婚姻状况=单身/已婚/丧偶/分居/离婚),那么我需要创建 N 个虚拟变量,每个可能的级别都有一个。这是使用 step_dummy(one_hot = TRUE) 完成的。

但是,如果类别是二进制的(pokemon_fan =“yes”/“no”),那么我只需要创建一个名为“pokemon_fan_yes”的虚拟对象。这是使用 step_dummy(one_hot = FALSE) 完成的。

step_dummy 是否可以计算级别数并根据该数字进行不同的处理?

谢谢。

If a categorical variable has more than 2 values (like marital status= single/married/widowed/separated/divorced), then I need to create N dummies, one for each of the possible levels. This is done using step_dummy(one_hot = TRUE).

However, if the category is binary (pokemon_fan = "yes"/"no") then I only need to create a single dummy called "pokemon_fan_yes". This is done using step_dummy(one_hot = FALSE).

Is it possible for step_dummy to count the number of levels and proceed differently depending on that number?

thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

悲喜皆因你 2025-01-16 09:42:17

在食谱本身中没有自动方法来执行此操作,但我认为您可以创建一个函数来为您处理此问题,如下所示:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(crickets, package = "modeldata")

levels_more_than <- function(vec, num = 2) {
  n_distinct(levels(vec)) > num
}

recipe(~ ., data = crickets) %>%
  step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 31 × 3
#>     temp  rate species_O..niveus
#>    <dbl> <dbl>             <dbl>
#>  1  20.8  67.9                 0
#>  2  20.8  65.1                 0
#>  3  24    77.3                 0
#>  4  24    78.7                 0
#>  5  24    79.4                 0
#>  6  24    80.4                 0
#>  7  26.2  85.8                 0
#>  8  26.2  86.6                 0
#>  9  26.2  87.5                 0
#> 10  26.2  89.1                 0
#> # … with 21 more rows

recipe(~ ., data = iris) %>%
  step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 150 × 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#>           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
#>  1          5.1         3.5          1.4         0.2              1
#>  2          4.9         3            1.4         0.2              1
#>  3          4.7         3.2          1.3         0.2              1
#>  4          4.6         3.1          1.5         0.2              1
#>  5          5           3.6          1.4         0.2              1
#>  6          5.4         3.9          1.7         0.4              1
#>  7          4.6         3.4          1.4         0.3              1
#>  8          5           3.4          1.5         0.2              1
#>  9          4.4         2.9          1.4         0.2              1
#> 10          4.9         3.1          1.5         0.1              1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> #   Species_virginica <dbl>

reprex 包 (v2.0.1)

这里是使用不完全标准选择器的一些技巧在食谱中

There is no automatic way to do this within recipes itself, but I think you can create a function that will handle this for you, something like this:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(crickets, package = "modeldata")

levels_more_than <- function(vec, num = 2) {
  n_distinct(levels(vec)) > num
}

recipe(~ ., data = crickets) %>%
  step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 31 × 3
#>     temp  rate species_O..niveus
#>    <dbl> <dbl>             <dbl>
#>  1  20.8  67.9                 0
#>  2  20.8  65.1                 0
#>  3  24    77.3                 0
#>  4  24    78.7                 0
#>  5  24    79.4                 0
#>  6  24    80.4                 0
#>  7  26.2  85.8                 0
#>  8  26.2  86.6                 0
#>  9  26.2  87.5                 0
#> 10  26.2  89.1                 0
#> # … with 21 more rows

recipe(~ ., data = iris) %>%
  step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 150 × 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#>           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
#>  1          5.1         3.5          1.4         0.2              1
#>  2          4.9         3            1.4         0.2              1
#>  3          4.7         3.2          1.3         0.2              1
#>  4          4.6         3.1          1.5         0.2              1
#>  5          5           3.6          1.4         0.2              1
#>  6          5.4         3.9          1.7         0.4              1
#>  7          4.6         3.4          1.4         0.3              1
#>  8          5           3.4          1.5         0.2              1
#>  9          4.4         2.9          1.4         0.2              1
#> 10          4.9         3.1          1.5         0.1              1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> #   Species_virginica <dbl>

Created on 2022-02-23 by the reprex package (v2.0.1)

Here are some tips for using not-quite-standard selectors in recipes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文