加入许多:结合相关特征

发布于 2025-02-13 08:11:08 字数 1292 浏览 1 评论 0原文

我有一个数据框,其中每行代表空间单元。 NBID*变量表示哪个单元是邻居。我想将邻居的 dum 变量进入主要数据帧。 (而不是空间单位,它可能是数据框架中的任何关系 - 业务伙伴,亲戚,相关基因等) 一些简化的数据看起来像这样:(

seed(999)
df_base <- data.frame(id = seq(1:100),
                 dum= sample(c(rep(0,50), rep(1,50)),100),
                 nbid_1=sample(1:100,100),
                 nbid_2=sample(1:100,100),
                 nbid_3=sample(1:100,100)) %>% 
  mutate(nbid_1 =  replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
         nbid_2 =  replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
         nbid_3 =  replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))

在这些简化的数据和实际数据中,邻居1,2和3可以相同,但这对问题并不重要。)

我的方法是重复然后加入看起来像这样的数据:

df1 <- df_base
df2 <- df_base %>% 
  select(-c(nbid_1,nbid_2,nbid_3)) %>% 
  rename(nbdum=dum)

df <- left_join(df1,df2,by=c("nbid_1"="id")) %>% 
  rename(nbdum1=nbdum) %>% 
  left_join(.,df2,by=c("nbid_2"="id")) %>% 
  rename(nbdum2=nbdum) %>% 
  left_join(.,df2,by=c("nbid_3"="id")) %>% 
  rename(nbdum3=nbdum)

df 是我要寻找的结果 - 从这里我可以创建一个整体邻居假人或计数。 但是,使用具有更多邻居的真实数据实施这种方法既不优雅也不可行。

如何以少量的方式解决这个问题?

事先感谢您的想法!!

I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.)
Some simplified data look like this:

seed(999)
df_base <- data.frame(id = seq(1:100),
                 dum= sample(c(rep(0,50), rep(1,50)),100),
                 nbid_1=sample(1:100,100),
                 nbid_2=sample(1:100,100),
                 nbid_3=sample(1:100,100)) %>% 
  mutate(nbid_1 =  replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
         nbid_2 =  replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
         nbid_3 =  replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))

(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)

My approach was to duplicate and then join the data, which would look like this:

df1 <- df_base
df2 <- df_base %>% 
  select(-c(nbid_1,nbid_2,nbid_3)) %>% 
  rename(nbdum=dum)

df <- left_join(df1,df2,by=c("nbid_1"="id")) %>% 
  rename(nbdum1=nbdum) %>% 
  left_join(.,df2,by=c("nbid_2"="id")) %>% 
  rename(nbdum2=nbdum) %>% 
  left_join(.,df2,by=c("nbid_3"="id")) %>% 
  rename(nbdum3=nbdum)

df is the result that I am looking for - from here I can create an overall neighbour dummy or a count.
This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.

How can I solve this in a less clumsy way?

Thanks in advance for your ideas!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜灵血窟げ 2025-02-20 08:11:08

一个关键线索是,当您看到var_1,var_2,...,var_n时,它表明数据可以转换为更长。请参阅pivot_longer()data.table :: melt()其中Molten数据经常讨论。

就您的示例而言,我们可以旋转,然后加入df2表。我不确定是否需要该格式,但是在加入后,我们可以使用pivot_wider()返回宽。

library(dplyr)
library(tidyr)


df1 %>%
  select(!id) %>%
  pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
  mutate(original_id = rep(1:100, each = 3))%>%
  left_join(df2, by = c("value" = "id"))%>%
  pivot_wider(original_id, values_from = c(value, nbdum))

#> # A tibble: 100 × 7
#>    original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#>          <int>   <int>   <int>   <int>   <dbl>   <dbl>   <dbl>
#>  1           1      25      90      23       0       0       1
#>  2           2      12      NA      NA       1      NA      NA
#>  3           3      11      40      47       0       0       0
#>  4           4      94      87      NA       0       1      NA
#>  5           5      46      77      NA       1       0      NA
#>  6           6      98      82      NA       1       0      NA
#>  7           7      43      NA      NA       1      NA      NA
#>  8           8      74      NA       7       0      NA       1
#>  9           9      57      NA      NA       1      NA      NA
#> 10          10      49      72      NA       0       0      NA
#> # … with 90 more rows

## compare to original

as_tibble(df)
#> # A tibble: 100 × 8
#>       id   dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#>    <int> <dbl>  <int>  <int>  <int>  <dbl>  <dbl>  <dbl>
#>  1     1     0     25     90     23      0      0      1
#>  2     2     1     12     NA     NA      1     NA     NA
#>  3     3     1     11     40     47      0      0      0
#>  4     4     1     94     87     NA      0      1     NA
#>  5     5     0     46     77     NA      1      0     NA
#>  6     6     1     98     82     NA      1      0     NA
#>  7     7     1     43     NA     NA      1     NA     NA
#>  8     8     0     74     NA      7      0     NA      1
#>  9     9     0     57     NA     NA      1     NA     NA
#> 10    10     0     49     72     NA      0      0     NA
#> # … with 90 more rows

A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.

For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().

library(dplyr)
library(tidyr)


df1 %>%
  select(!id) %>%
  pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
  mutate(original_id = rep(1:100, each = 3))%>%
  left_join(df2, by = c("value" = "id"))%>%
  pivot_wider(original_id, values_from = c(value, nbdum))

#> # A tibble: 100 × 7
#>    original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#>          <int>   <int>   <int>   <int>   <dbl>   <dbl>   <dbl>
#>  1           1      25      90      23       0       0       1
#>  2           2      12      NA      NA       1      NA      NA
#>  3           3      11      40      47       0       0       0
#>  4           4      94      87      NA       0       1      NA
#>  5           5      46      77      NA       1       0      NA
#>  6           6      98      82      NA       1       0      NA
#>  7           7      43      NA      NA       1      NA      NA
#>  8           8      74      NA       7       0      NA       1
#>  9           9      57      NA      NA       1      NA      NA
#> 10          10      49      72      NA       0       0      NA
#> # … with 90 more rows

## compare to original

as_tibble(df)
#> # A tibble: 100 × 8
#>       id   dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#>    <int> <dbl>  <int>  <int>  <int>  <dbl>  <dbl>  <dbl>
#>  1     1     0     25     90     23      0      0      1
#>  2     2     1     12     NA     NA      1     NA     NA
#>  3     3     1     11     40     47      0      0      0
#>  4     4     1     94     87     NA      0      1     NA
#>  5     5     0     46     77     NA      1      0     NA
#>  6     6     1     98     82     NA      1      0     NA
#>  7     7     1     43     NA     NA      1     NA     NA
#>  8     8     0     74     NA      7      0     NA      1
#>  9     9     0     57     NA     NA      1     NA     NA
#> 10    10     0     49     72     NA      0      0     NA
#> # … with 90 more rows
好多鱼好多余 2025-02-20 08:11:08

因为您似乎只是在与邻居变量索引dum,您应该能够做到:

library(dplyr)

df_base %>%
  mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))

     id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1     1   0     25     90     23      0      0      1
2     2   1     12     NA     NA      1     NA     NA
3     3   1     11     40     47      0      0      0
4     4   1     94     87     NA      0      1     NA
5     5   0     46     77     NA      1      0     NA
6     6   1     98     82     NA      1      0     NA
7     7   1     43     NA     NA      1     NA     NA
8     8   0     74     NA      7      0     NA      1
9     9   0     57     NA     NA      1     NA     NA
10   10   0     49     72     NA      0      0     NA
...

或在基本r:相同的想法:

df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])   

As you just seem to be indexing dum with your neighbor variables you should be able to do:

library(dplyr)

df_base %>%
  mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))

     id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1     1   0     25     90     23      0      0      1
2     2   1     12     NA     NA      1     NA     NA
3     3   1     11     40     47      0      0      0
4     4   1     94     87     NA      0      1     NA
5     5   0     46     77     NA      1      0     NA
6     6   1     98     82     NA      1      0     NA
7     7   1     43     NA     NA      1     NA     NA
8     8   0     74     NA      7      0     NA      1
9     9   0     57     NA     NA      1     NA     NA
10   10   0     49     72     NA      0      0     NA
...

Or same idea in base R:

df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])   
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文