加入许多:结合相关特征
我有一个数据框,其中每行代表空间单元。 NBID*变量表示哪个单元是邻居。我想将邻居的 dum 变量进入主要数据帧。 (而不是空间单位,它可能是数据框架中的任何关系 - 业务伙伴,亲戚,相关基因等) 一些简化的数据看起来像这样:(
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
在这些简化的数据和实际数据中,邻居1,2和3可以相同,但这对问题并不重要。)
我的方法是重复然后加入看起来像这样的数据:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df 是我要寻找的结果 - 从这里我可以创建一个整体邻居假人或计数。 但是,使用具有更多邻居的真实数据实施这种方法既不优雅也不可行。
如何以少量的方式解决这个问题?
事先感谢您的想法!!
I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.)
Some simplified data look like this:
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)
My approach was to duplicate and then join the data, which would look like this:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df is the result that I am looking for - from here I can create an overall neighbour dummy or a count.
This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.
How can I solve this in a less clumsy way?
Thanks in advance for your ideas!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个关键线索是,当您看到
var_1,var_2,...,var_n
时,它表明数据可以转换为更长。请参阅pivot_longer()
或data.table :: melt()
其中Molten
数据经常讨论。就您的示例而言,我们可以旋转,然后加入
df2
表。我不确定是否需要该格式,但是在加入后,我们可以使用pivot_wider()
返回宽。A key clue is that when you see
var_1, var_2, ..., var_n
, it suggests that the data can be transformed to be longer. Seepivot_longer()
ordata.table::melt()
wheremolten
data is discussed frequently.For your example, we can pivot and then join the
df2
table back. I am unsure if the format is needed but after the join, we can pivot back to wide withpivot_wider()
.因为您似乎只是在与邻居变量索引
dum
,您应该能够做到:或在基本r:相同的想法:
As you just seem to be indexing
dum
with your neighbor variables you should be able to do:Or same idea in base R: