虚拟编码语法(一个热门编码问题)
我有如下所示的样本数据:
id <- c("1a","2c","3d","4f","5g","6e","7f","8q","9r","10v","11x","12l")
O <- c(1,1,0,1,1,0,0,1,0,1,0,1)
dg1 <- c("A02","A84","B12","C94","D37","D12","D68","E12","F48","H12","Z83","")
dg2 <- c("B18","N34","A02","M01","B12","J02","K52","","I10","","","B18")
df <- cbind.data.frame(id,O,dg1,dg2)
我正在尝试获取一个如下所示的数据框,以便我可以针对每个变量对 O 进行单变量逻辑回归。
A02 <- c(1,0,1,0,0,0,0,0,0,0,0,0)
A84 <- c(0,1,0,0,0,0,0,0,0,0,0,0)
B12 <- c(0,0,1,0,1,0,0,0,0,0,0,0)
B18 <- c(1,0,0,0,0,0,0,0,0,0,0,1)
C94 <- c(0,0,0,1,0,0,0,0,0,0,0,0)
D12 <- c(0,0,0,0,0,1,0,0,0,0,0,0)
D37 <- c(0,0,0,0,1,0,0,0,0,0,0,0)
D68 <- c(0,0,0,0,0,0,1,0,0,0,0,0)
E12 <- c(0,0,0,0,0,0,0,1,0,0,0,0)
F48 <- c(0,0,0,0,0,0,0,0,1,0,0,0)
H12 <- c(0,0,0,0,0,0,0,0,0,1,0,0)
I10 <- c(0,0,0,0,0,0,0,0,1,0,0,0)
J02 <- c(0,0,0,0,0,1,0,0,0,0,0,0)
K52 <- c(0,0,0,0,0,0,1,0,0,0,0,0)
M01 <- c(0,0,0,1,0,0,0,0,0,0,0,0)
N34 <- c(0,1,0,0,0,0,0,0,0,0,0,0)
Z83 <- c(0,0,0,0,0,0,0,0,0,0,1,0)
df <- cbind.data.frame(df,A02,A84,B12,B18,C94,D12,D37,D68,E12,F48,H12,I10,J02,K52,M01,N34,Z83)
我尝试遵循此处的代码和此处 但遇到了我不知道如何解决的问题 使固定。谁能指出我的错误/误解?我更愿意在 dplyr 或 base 中找到解决方案,但真的愿意尝试任何事情。
尝试:
dumbo <- model.matrix(id ~ dg1+dg2,df)
library(recipes)
dumber <- df %>% recipe(id ~ .) %>%
step_dummy(dg1:dg2,
one_hot = TRUE) %>%
prep() %>% bake(new_data=NULL)
I have sample data that looks like this:
id <- c("1a","2c","3d","4f","5g","6e","7f","8q","9r","10v","11x","12l")
O <- c(1,1,0,1,1,0,0,1,0,1,0,1)
dg1 <- c("A02","A84","B12","C94","D37","D12","D68","E12","F48","H12","Z83","")
dg2 <- c("B18","N34","A02","M01","B12","J02","K52","","I10","","","B18")
df <- cbind.data.frame(id,O,dg1,dg2)
I am trying to get a data frame that looks like this so that I can do a univariate logistic regression on O against each variable.
A02 <- c(1,0,1,0,0,0,0,0,0,0,0,0)
A84 <- c(0,1,0,0,0,0,0,0,0,0,0,0)
B12 <- c(0,0,1,0,1,0,0,0,0,0,0,0)
B18 <- c(1,0,0,0,0,0,0,0,0,0,0,1)
C94 <- c(0,0,0,1,0,0,0,0,0,0,0,0)
D12 <- c(0,0,0,0,0,1,0,0,0,0,0,0)
D37 <- c(0,0,0,0,1,0,0,0,0,0,0,0)
D68 <- c(0,0,0,0,0,0,1,0,0,0,0,0)
E12 <- c(0,0,0,0,0,0,0,1,0,0,0,0)
F48 <- c(0,0,0,0,0,0,0,0,1,0,0,0)
H12 <- c(0,0,0,0,0,0,0,0,0,1,0,0)
I10 <- c(0,0,0,0,0,0,0,0,1,0,0,0)
J02 <- c(0,0,0,0,0,1,0,0,0,0,0,0)
K52 <- c(0,0,0,0,0,0,1,0,0,0,0,0)
M01 <- c(0,0,0,1,0,0,0,0,0,0,0,0)
N34 <- c(0,1,0,0,0,0,0,0,0,0,0,0)
Z83 <- c(0,0,0,0,0,0,0,0,0,0,1,0)
df <- cbind.data.frame(df,A02,A84,B12,B18,C94,D12,D37,D68,E12,F48,H12,I10,J02,K52,M01,N34,Z83)
I've attempted to follow the code here and here but ran into issues that I wasn't sure how to fix. Can anyone point out my mistake/misunderstanding? I would prefer to have a solution in dplyr or base, but really willing to try anything.
Attempts:
dumbo <- model.matrix(id ~ dg1+dg2,df)
library(recipes)
dumber <- df %>% recipe(id ~ .) %>%
step_dummy(dg1:dg2,
one_hot = TRUE) %>%
prep() %>% bake(new_data=NULL)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我在 github 上有一个包 {dplyover} ,它有助于解决此类问题而无需数据矩形(旋转)。为了使其工作,我们首先需要将空单元格
""
转换为NA
。然后我们可以使用 dplyover::dist_values 来获取没有 NA 的唯一值,并循环它们以创建新列。我们需要按行执行此操作,因为值可以位于 dg1 或 dg2 中。由 reprex 软件包 (v0.3.0) 创建于 2022 年 3 月 24 日
I have a package {dplyover} on github which helps to solve this kind of problems without data rectangling (pivoting). To make it work we first need to convert the empty cells
""
intoNA
s. Then we can usedplyover::dist_values
to get the unique values withoutNA
s and loop over them to create new columns. We need to do thisrowwise
, since the values can be either indg1
ordg2
.Created on 2022-03-24 by the reprex package (v0.3.0)