在R中,将多列整数转换为因子的最佳方法是什么
这适用于一个玩具示例,但我认为对于更大的数据帧,必须有更好的方法来做到这一点(更快,更少的内存)。任何建议表示赞赏!
library(tidyverse)
library(tictoc)
cyl <- tibble(integer_value = unique(mtcars$cyl),
as_a_string = paste(unique(mtcars$cyl), " cylinders"))%>%
mutate(variable = "cyl")
gear <- tibble(integer_value = unique(mtcars$gear),
as_a_string = paste(unique(mtcars$cyl), " gears"))%>%
mutate(variable = "gear")
carb <- tibble(integer_value = unique(mtcars$carb),
as_a_string = paste(unique(mtcars$carb)," carburetors"))%>%
mutate(variable = "carb")
vs <- tibble(integer_value = unique(mtcars$vs),
as_a_string = c("V shaped", "straight"))%>%
mutate(variable = "vs")
am <- tibble(integer_value = unique(mtcars$vs),
as_a_string = c("Automatic", "Manual"))%>%
mutate(variable = "am")
factor_info <- rbind(cyl,gear,carb,vs,am)%>%
select(variable,everything())
df <- mtcars
tic()
for(var in unique(factor_info$variable)){
col <- mtcars%>%
select(all_of(var))%>%
mutate(variable = all_of(var))%>%
rename(integer_value = all_of(var))
fac <- factor_info%>%
filter(variable == all_of(var))
df[[all_of(var)]] <- inner_join(col, fac)%>%
select(as_a_string)%>%
pull()
}
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
df <- df%>%
as_tibble() %>%
mutate(across(where(is.character), factor))
toc()
#> 0.172 sec elapsed
由 reprex 软件包 (v2.0.1) 创建于 2022 年 2 月 25 日
This works for a toy example, but I think there must be a better way to do this (faster, less memory) for larger dataframes. Any suggestions appreciated!
library(tidyverse)
library(tictoc)
cyl <- tibble(integer_value = unique(mtcars$cyl),
as_a_string = paste(unique(mtcars$cyl), " cylinders"))%>%
mutate(variable = "cyl")
gear <- tibble(integer_value = unique(mtcars$gear),
as_a_string = paste(unique(mtcars$cyl), " gears"))%>%
mutate(variable = "gear")
carb <- tibble(integer_value = unique(mtcars$carb),
as_a_string = paste(unique(mtcars$carb)," carburetors"))%>%
mutate(variable = "carb")
vs <- tibble(integer_value = unique(mtcars$vs),
as_a_string = c("V shaped", "straight"))%>%
mutate(variable = "vs")
am <- tibble(integer_value = unique(mtcars$vs),
as_a_string = c("Automatic", "Manual"))%>%
mutate(variable = "am")
factor_info <- rbind(cyl,gear,carb,vs,am)%>%
select(variable,everything())
df <- mtcars
tic()
for(var in unique(factor_info$variable)){
col <- mtcars%>%
select(all_of(var))%>%
mutate(variable = all_of(var))%>%
rename(integer_value = all_of(var))
fac <- factor_info%>%
filter(variable == all_of(var))
df[[all_of(var)]] <- inner_join(col, fac)%>%
select(as_a_string)%>%
pull()
}
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
#> Joining, by = c("integer_value", "variable")
df <- df%>%
as_tibble() %>%
mutate(across(where(is.character), factor))
toc()
#> 0.172 sec elapsed
Created on 2022-02-25 by the reprex package (v2.0.1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我没有足够的代表来添加评论。这是 TarJae 答案下的另一个变体。使用
purrr::modify_if
:由 reprex 包于 2022 年 2 月 26 日创建< /a> (v2.0.1)
I don't have enough rep to add a comment. Here is another variant to put under TarJae's answer. Using
purrr::modify_if
:Created on 2022-02-26 by the reprex package (v2.0.1)
新答案删除了第一个。我认为您需要来自
forcats
包的fct_relabel
:经过 0.04 秒:输出:
New answer deleted the first one. I think you need
fct_relabel
fromforcats
package: elapsed 0.04 sec:output:
当
x
的类型为integer
时,as.factor(x)
比factor(x)
更快、更高效并且length(x)
很大。 mtcars 中的分类变量是整数值,但存储为 double:在这种情况下,您可以强制有效地分解因子,
然后使用 FWIW 有效地重命名级别
,这整个操作在我的机器上花费不到一毫秒。
as.factor(x)
is faster and more efficient thanfactor(x)
whenx
is of typeinteger
andlength(x)
is large. The categorical variables inmtcars
are integer-valued but stored asdouble
:In this situation, you can coerce to factor efficiently with
then rename the levels efficiently with
FWIW, this entire operation takes less than a millisecond on my machine.