加入从R中的两个数据帧重叠范围

发布于 2025-01-28 06:32:27 字数 1578 浏览 1 评论 0原文

注意:这个问题是“重复”的。提供的解决方案在这里没有回答我的问题。他们展示了如何在单个条目落在一个范围内的情况下合并,我正在尝试识别重叠的范围并加入它们。也许我的标题本来可以更好...

我有一个主数据集main_df,带有开始和结束时间(以秒为单位)。我想看看main_df中的时间范围是否属于lookup_df中的范围列表,如果是这样,请从look> lookup_df 中获取值。此外,如果main_df属于两个不同的查找范围,请重复该行,以表示每个值。***

main_df <- tibble(start = c(30,124,161),
                end = c(80,152,185))

lookup_df <- tibble(start = c(34,73,126,141,174,221),
                       end = c(69,123,136,157,189,267),
                       value = c('a','b','b','b','b','a'))

# Do something here to get the following:

> final_df
# A tibble: 4 x 4
  start   end value notes                                      
  <dbl> <dbl> <chr> <chr>                                      
1    30    80 a     ""                                         
2    30    80 b     "Duplicate because it falls within a and b"
3   124   152 b     "Falls within two lookups but both are b"  
4   161   185 b     ""      

方式...

#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>% 
  add_rows(through_some_means)

***编辑:查看我结构问题的 除了要添加新行,我可以翻转如何加入它们...

semi_join(lookup_df, main_df, by(some_range_join_function))

Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...

I have a main data set main_df with a start and end time (in seconds). I would like to see if the time range in main_df falls within a list of ranges in lookup_df, and if so, grab the value from lookup_df. Additionally, if the main_df falls within two different lookup ranges, duplicate the row so each value is represented.***

main_df <- tibble(start = c(30,124,161),
                end = c(80,152,185))

lookup_df <- tibble(start = c(34,73,126,141,174,221),
                       end = c(69,123,136,157,189,267),
                       value = c('a','b','b','b','b','a'))

# Do something here to get the following:

> final_df
# A tibble: 4 x 4
  start   end value notes                                      
  <dbl> <dbl> <chr> <chr>                                      
1    30    80 a     ""                                         
2    30    80 b     "Duplicate because it falls within a and b"
3   124   152 b     "Falls within two lookups but both are b"  
4   161   185 b     ""      

***Edit: Looking at the way I've structured the problem...

#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>% 
  add_rows(through_some_means)

Rather than having to add a new row I could flip how I'm joining them...

semi_join(lookup_df, main_df, by(some_range_join_function))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

明天过后 2025-02-04 06:32:27

您可以进行一些合乎逻辑的比较,然后处理一个情况,如果所有情况都是'b''a''b',等等。通过这种方式,您很容易添加更多情况,例如'a',一个是'a',更多是'b'您没有在OP中声明。该方法产生null,如果没有匹配的rbind中的匹配。

f <- \(x, y) {
  w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
  if (length(w) > 0) {
    d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
    if (length(w) >= 2) {
      if (all(d$value == 'b')) {
        d <- d[!duplicated(d$value), ]
        d$notes[1] <- 'both b'
      }
      else {
        d$notes[nrow(d)] <- 'a & b'
      }
    }
    d
  }
}

apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
#   start end value  notes
# 1    30  80     a       
# 2    30  80     b  a & b
# 3   124 152     b both b
# 4   161 185     b     

数据:

main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152, 
185)), row.names = c(NA, -4L), class = "data.frame")

lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69, 
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b", 
"a")), row.names = c(NA, -6L), class = "data.frame")

You could do some logical comparisons and then a case handling what shall happen if all are 'b', 'a' and 'b', etc. In this way you easily could add more cases, e.g. both are 'a', one is 'a', more are 'b' which you didn't declare in OP. The approach yields NULL if there are no matches which gets omitted during rbind.

f <- \(x, y) {
  w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
  if (length(w) > 0) {
    d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
    if (length(w) >= 2) {
      if (all(d$value == 'b')) {
        d <- d[!duplicated(d$value), ]
        d$notes[1] <- 'both b'
      }
      else {
        d$notes[nrow(d)] <- 'a & b'
      }
    }
    d
  }
}

apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
#   start end value  notes
# 1    30  80     a       
# 2    30  80     b  a & b
# 3   124 152     b both b
# 4   161 185     b     

Data:

main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152, 
185)), row.names = c(NA, -4L), class = "data.frame")

lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69, 
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b", 
"a")), row.names = c(NA, -6L), class = "data.frame")
猫性小仙女 2025-02-04 06:32:27

另一个选项是fuzzyjoin :: Interval_join

library(fuzzyjoin)
library(dplyr)

interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>% 
  group_by(value, start.x, end.x) %>% 
  slice(1) %>% 
  select(start = start.x, end = end.x, value)

# A tibble: 4 × 3
# Groups:   value, start, end [4]
  start   end value
  <dbl> <dbl> <chr>
1    30    80 a    
2    30    80 b    
3   124   152 b    
4   161   185 b    

Another option is fuzzyjoin::interval_join:

library(fuzzyjoin)
library(dplyr)

interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>% 
  group_by(value, start.x, end.x) %>% 
  slice(1) %>% 
  select(start = start.x, end = end.x, value)

# A tibble: 4 × 3
# Groups:   value, start, end [4]
  start   end value
  <dbl> <dbl> <chr>
1    30    80 a    
2    30    80 b    
3   124   152 b    
4   161   185 b    
萧瑟寒风 2025-02-04 06:32:27

您可以从data.table使用foverlaps

library(data.table)

setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed

setkey(main_df, start, end) # set the keys of 'y'

foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup

#    start end i.start i.end value
# 1:    30  80      34    69     a
# 2:    30  80      73   123     b
# 3:   124 152     126   136     b
# 4:   124 152     141   157     b
# 5:   161 185     174   189     b

或以最终结果获取清洁结果(OP的Final_df)

unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])

   start end value
1:    30  80     a
2:    30  80     b
3:   124 152     b
4:   161 185     b

You can use foverlaps from data.table for this.

library(data.table)

setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed

setkey(main_df, start, end) # set the keys of 'y'

foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup

#    start end i.start i.end value
# 1:    30  80      34    69     a
# 2:    30  80      73   123     b
# 3:   124 152     126   136     b
# 4:   124 152     141   157     b
# 5:   161 185     174   189     b

Or to get the cleaned results as end result (OP's final_df)

unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])

   start end value
1:    30  80     a
2:    30  80     b
3:   124 152     b
4:   161 185     b
烟燃烟灭 2025-02-04 06:32:27

基于powerjoin的可能解决方案:

library(tidyverse)
library(powerjoin)

power_left_join(
  main_df, lookup_df,
  by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
    (.x$start >= .y$start & .x$start <= .y$end) | 
    (.x$start <= .y$start & .x$end >= .y$start), 
  keep = "left") %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b

或使用tidyr :: Crossing

library(tidyverse)

crossing(main_df, lookup_df,
        .name_repair = ~ c("start", "end", "start2", "end2", "value")) %>% 
  filter((start <= start2 & end >= end2) |
         (start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>% 
  select(-start2, -end2) %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b

A possible solution, based on powerjoin:

library(tidyverse)
library(powerjoin)

power_left_join(
  main_df, lookup_df,
  by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
    (.x$start >= .y$start & .x$start <= .y$end) | 
    (.x$start <= .y$start & .x$end >= .y$start), 
  keep = "left") %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b

Or using tidyr::crossing:

library(tidyverse)

crossing(main_df, lookup_df,
        .name_repair = ~ c("start", "end", "start2", "end2", "value")) %>% 
  filter((start <= start2 & end >= end2) |
         (start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>% 
  select(-start2, -end2) %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b
草莓味的萝莉 2025-02-04 06:32:27

您可以使用fuzzyjoin软件包根据fuzzyjoin :: Interval _*_ join()函数的间隔来加入。

我会使用内部连接,因为如果您像建议的那样使用半连接,则将失去价值col,只能获得3行。

library(tidyverse)
library(fuzzyjoin)

fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#>   start.x end.x value start.y end.y
#>     <dbl> <dbl> <chr>   <dbl> <dbl>
#> 1      34    69 a          30    80
#> 2      73   123 b          30    80
#> 3     126   136 b         124   152
#> 4     141   157 b         124   152
#> 5     174   189 b         161   185

如您所见,fuzzy_inner_join()从两个表中保留cols,因为它们在模糊加入中不一样。另外,我们在main_df中仍然有单独的行,与lookup_df中的多个情况匹配。因此,我们对加入表进行了一些清理:

interval_inner_join(lookup_df, main_df, 
                    by = c("start", "end"), 
                    type = "any") |> 
  select(-ends_with(".x")) |> # remove lookup interval cols
  distinct() |> # remove duplicate
  rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#>   value start   end
#>   <chr> <dbl> <dbl>
#> 1 a        30    80
#> 2 b        30    80
#> 3 b       124   152
#> 4 b       161   185

最后,对术语的澄清:在您的问题中,您要根据main_df falling 中加入的间隔加入间隔lookup_df。通过在Interval _*_ join()中使用type =“ in in in in in ofer”。但是根据您提供的示例,您似乎要基于任何重叠加入。可以使用type =“ Any”来完成,但这是默认值,因此您无需指定。

You can use the fuzzyjoin package to join based on intervals with the fuzzyjoin::interval_*_join() functions.

I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.

library(tidyverse)
library(fuzzyjoin)

fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#>   start.x end.x value start.y end.y
#>     <dbl> <dbl> <chr>   <dbl> <dbl>
#> 1      34    69 a          30    80
#> 2      73   123 b          30    80
#> 3     126   136 b         124   152
#> 4     141   157 b         124   152
#> 5     174   189 b         161   185

As you can see, the fuzzy_inner_join() preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases in main_df that match multiple cases in lookup_df. Thus, we do some cleanup of the joined table:

interval_inner_join(lookup_df, main_df, 
                    by = c("start", "end"), 
                    type = "any") |> 
  select(-ends_with(".x")) |> # remove lookup interval cols
  distinct() |> # remove duplicate
  rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#>   value start   end
#>   <chr> <dbl> <dbl>
#> 1 a        30    80
#> 2 b        30    80
#> 3 b       124   152
#> 4 b       161   185

Finally, a clarification of terminology: In your question you state you want to join based on the interval from main_df falling within the interval from lookup_df. This is possible by using type = "within" in interval_*_join(). But based on the examples you provide, it appears you want to join based on any overlap. This can be done with type = "any", but it is the default, so you don't need to specify it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文