R中基于多列的匹配数据框

发布于 2025-01-15 12:23:52 字数 1400 浏览 3 评论 0原文

我有两个像这样的巨大数据集。

df2 中有一种水果，PEACH，由于某种原因从 df1 中丢失了。我想在 df1 中添加缺少的水果。

library(tidyverse)

df1 <- tibble(central_fruit=c("ananas","apple"),
              fruits=c("ananas,anan,anannas",("apple,appl,appless")),
              counts=c("100,10,1","50,20,2"))
df1
#> # A tibble: 2 × 3
#>   central_fruit fruits              counts  
#>   <chr>         <chr>               <chr>   
#> 1 ananas        ananas,anan,anannas 100,10,1
#> 2 apple         apple,appl,appless  50,20,2

df2 <- tibble(fruit=c("ananas","anan","anannas","apple","appl","appless","PEACH"),
              counts=c(100,10,1,50,20,2,1000))
df2
#> # A tibble: 7 × 2
#>   fruit   counts
#>   <chr>    <dbl>
#> 1 ananas     100
#> 2 anan        10
#> 3 anannas      1
#> 4 apple       50
#> 5 appl        20
#> 6 appless      2
#> 7 PEACH     1000

^{由 reprex 包 (v2.0.1)}于 2022 年 3 月 20 日创建

希望我的数据看起来像这样

df1 
   central_fruit fruits              counts  
   <chr>         <chr>               <chr>   
 1 ananas        ananas,anan,anannas 100,10,1
 2 apple         apple,appl,appless  50,20,2
 3 PEACH            NA               1000

任何帮助或建议非常感谢

原文

I have two huge datasets that look like this.

there is one fruit from df2, PEACH, which is missing for any reason from df1.
I want to add in df1 the fruits that are missing.

library(tidyverse)

df1 <- tibble(central_fruit=c("ananas","apple"),
              fruits=c("ananas,anan,anannas",("apple,appl,appless")),
              counts=c("100,10,1","50,20,2"))
df1
#> # A tibble: 2 × 3
#>   central_fruit fruits              counts  
#>   <chr>         <chr>               <chr>   
#> 1 ananas        ananas,anan,anannas 100,10,1
#> 2 apple         apple,appl,appless  50,20,2

df2 <- tibble(fruit=c("ananas","anan","anannas","apple","appl","appless","PEACH"),
              counts=c(100,10,1,50,20,2,1000))
df2
#> # A tibble: 7 × 2
#>   fruit   counts
#>   <chr>    <dbl>
#> 1 ananas     100
#> 2 anan        10
#> 3 anannas      1
#> 4 apple       50
#> 5 appl        20
#> 6 appless      2
#> 7 PEACH     1000

^{Created on 2022-03-20 by the reprex package (v2.0.1)}

I want my data to look like this

df1 
   central_fruit fruits              counts  
   <chr>         <chr>               <chr>   
 1 ananas        ananas,anan,anannas 100,10,1
 2 apple         apple,appl,appless  50,20,2
 3 PEACH            NA               1000

any help or advice are highly appreciated

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

厌倦 2025-01-22 12:23:52

请在下面找到一种可能的 data.table 方法。

Reprex

代码

library(tidyverse) # to read your tibbles
library(data.table)

setDT(df1)
setDT(df2)

df1[df2, on = .(central_fruit = fruit)
    ][, `:=` (counts = fcoalesce(counts, as.character(i.counts)), i.counts = NULL)
      ][central_fruit %chin% c(df1$central_fruit, setdiff(df2$fruit, unlist(strsplit(df1$fruit, ","))))][]

输出

#>    central_fruit              fruits   counts
#> 1:        ananas ananas,anan,anannas 100,10,1
#> 2:         apple  apple,appl,appless  50,20,2
#> 3:         PEACH                <NA>     1000

^{由 reprex 包于 2022 年 3 月 20 日创建(v2.0.1)}

Please find below one possible data.table approach.

Reprex

Code

library(tidyverse) # to read your tibbles
library(data.table)

setDT(df1)
setDT(df2)

df1[df2, on = .(central_fruit = fruit)
    ][, `:=` (counts = fcoalesce(counts, as.character(i.counts)), i.counts = NULL)
      ][central_fruit %chin% c(df1$central_fruit, setdiff(df2$fruit, unlist(strsplit(df1$fruit, ","))))][]

Output

#>    central_fruit              fruits   counts
#> 1:        ananas ananas,anan,anannas 100,10,1
#> 2:         apple  apple,appl,appless  50,20,2
#> 3:         PEACH                <NA>     1000

^{Created on 2022-03-20 by the reprex package (v2.0.1)}

回复收藏 0 原文

贵在坚持 2025-01-22 12:23:52

您可以通过分割逗号 fruits 和 counts 变量来获取长格式的数据集，使用 df2 执行 full_join >，调整 NA 值，并为每个 central_fruit 折叠值。

library(dplyr)
library(tidyr)

df1 %>%
  separate_rows(fruits, counts, convert = TRUE) %>%
  full_join(df2, by = c('fruits' = 'fruit')) %>%
  transmute(central_fruit = ifelse(is.na(central_fruit), fruits, central_fruit), 
            fruits = ifelse(is.na(counts.x), NA, fruits), 
            counts = coalesce(counts.x, counts.y)) %>%
  group_by(central_fruit) %>%
  summarise(across(.fns = toString))

# central_fruit fruits                counts    
#  <chr>         <chr>                 <chr>     
#1 ananas        ananas, anan, anannas 100, 10, 1
#2 apple         apple, appl, appless  50, 20, 2 
#3 PEACH         NA                    1000

You may get the dataset in a long format by splitting on comma fruits and counts variable, do a full_join with df2, adjust the NA values and for each central_fruit collapse the values.

library(dplyr)
library(tidyr)

df1 %>%
  separate_rows(fruits, counts, convert = TRUE) %>%
  full_join(df2, by = c('fruits' = 'fruit')) %>%
  transmute(central_fruit = ifelse(is.na(central_fruit), fruits, central_fruit), 
            fruits = ifelse(is.na(counts.x), NA, fruits), 
            counts = coalesce(counts.x, counts.y)) %>%
  group_by(central_fruit) %>%
  summarise(across(.fns = toString))

# central_fruit fruits                counts    
#  <chr>         <chr>                 <chr>     
#1 ananas        ananas, anan, anannas 100, 10, 1
#2 apple         apple, appl, appless  50, 20, 2 
#3 PEACH         NA                    1000

回复收藏 0 原文

偏爱自由 2025-01-22 12:23:52

您只需获取 df1 中存在的一组水果并使用它们来过滤 df2，然后将它们绑定在一起。

library(tidyverse)

present <- df1$fruits |> 
  str_split(",") |> 
  unlist()

df2 |> 
  rename(central_fruit = fruit) |> 
  filter(! central_fruit %in% present) |> 
  mutate(counts = as.character(counts)) |> 
  bind_rows(df1)
#> # A tibble: 3 × 3
#>   central_fruit counts   fruits             
#>   <chr>         <chr>    <chr>              
#> 1 PEACH         1000     <NA>               
#> 2 ananas        100,10,1 ananas,anan,anannas
#> 3 apple         50,20,2  apple,appl,appless

You can just take the set of fruits present in your df1 and use them to filter df2, then bind them together.

library(tidyverse)

present <- df1$fruits |> 
  str_split(",") |> 
  unlist()

df2 |> 
  rename(central_fruit = fruit) |> 
  filter(! central_fruit %in% present) |> 
  mutate(counts = as.character(counts)) |> 
  bind_rows(df1)
#> # A tibble: 3 × 3
#>   central_fruit counts   fruits             
#>   <chr>         <chr>    <chr>              
#> 1 PEACH         1000     <NA>               
#> 2 ananas        100,10,1 ananas,anan,anannas
#> 3 apple         50,20,2  apple,appl,appless

回复收藏 0 原文

~没有更多了~