加入从R中的两个数据帧重叠范围
注意:这个问题是“重复”的。提供的解决方案在这里没有回答我的问题。他们展示了如何在单个条目落在一个范围内的情况下合并,我正在尝试识别重叠的范围并加入它们。也许我的标题本来可以更好...
我有一个主数据集main_df
,带有开始和结束时间(以秒为单位)。我想看看main_df
中的时间范围是否属于lookup_df
中的范围列表,如果是这样,请从main_df
属于两个不同的查找范围,请重复该行,以表示每个值。***
main_df <- tibble(start = c(30,124,161),
end = c(80,152,185))
lookup_df <- tibble(start = c(34,73,126,141,174,221),
end = c(69,123,136,157,189,267),
value = c('a','b','b','b','b','a'))
# Do something here to get the following:
> final_df
# A tibble: 4 x 4
start end value notes
<dbl> <dbl> <chr> <chr>
1 30 80 a ""
2 30 80 b "Duplicate because it falls within a and b"
3 124 152 b "Falls within two lookups but both are b"
4 161 185 b ""
方式...
#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>%
add_rows(through_some_means)
***编辑:查看我结构问题的 除了要添加新行,我可以翻转如何加入它们...
semi_join(lookup_df, main_df, by(some_range_join_function))
Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...
I have a main data set main_df
with a start and end time (in seconds). I would like to see if the time range in main_df
falls within a list of ranges in lookup_df
, and if so, grab the value from lookup_df
. Additionally, if the main_df
falls within two different lookup ranges, duplicate the row so each value is represented.***
main_df <- tibble(start = c(30,124,161),
end = c(80,152,185))
lookup_df <- tibble(start = c(34,73,126,141,174,221),
end = c(69,123,136,157,189,267),
value = c('a','b','b','b','b','a'))
# Do something here to get the following:
> final_df
# A tibble: 4 x 4
start end value notes
<dbl> <dbl> <chr> <chr>
1 30 80 a ""
2 30 80 b "Duplicate because it falls within a and b"
3 124 152 b "Falls within two lookups but both are b"
4 161 185 b ""
***Edit: Looking at the way I've structured the problem...
#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>%
add_rows(through_some_means)
Rather than having to add a new row I could flip how I'm joining them...
semi_join(lookup_df, main_df, by(some_range_join_function))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以进行一些合乎逻辑的比较,然后处理一个情况,如果所有情况都是
'b'
,'a'
和'b'
,等等。通过这种方式,您很容易添加更多情况,例如'a'
,一个是'a'
,更多是'b'
您没有在OP中声明。该方法产生null
,如果没有匹配的rbind
中的匹配。数据:
You could do some logical comparisons and then a case handling what shall happen if all are
'b'
,'a'
and'b'
, etc. In this way you easily could add more cases, e.g. both are'a'
, one is'a'
, more are'b'
which you didn't declare in OP. The approach yieldsNULL
if there are no matches which gets omitted duringrbind
.Data:
另一个选项是
fuzzyjoin :: Interval_join
:Another option is
fuzzyjoin::interval_join
:您可以从
data.table
使用foverlaps
。或以最终结果获取清洁结果(OP的Final_df)
You can use
foverlaps
fromdata.table
for this.Or to get the cleaned results as end result (OP's final_df)
基于
powerjoin
的可能解决方案:或使用
tidyr :: Crossing
:A possible solution, based on
powerjoin
:Or using
tidyr::crossing
:您可以使用
fuzzyjoin
软件包根据fuzzyjoin :: Interval _*_ join()
函数的间隔来加入。我会使用内部连接,因为如果您像建议的那样使用半连接,则将失去价值col,只能获得3行。
如您所见,
fuzzy_inner_join()
从两个表中保留cols,因为它们在模糊加入中不一样。另外,我们在main_df
中仍然有单独的行,与lookup_df
中的多个情况匹配。因此,我们对加入表进行了一些清理:最后,对术语的澄清:在您的问题中,您要根据
main_df
falling 在中加入的间隔加入间隔lookup_df
。通过在Interval _*_ join()
中使用type =“ in in in in in ofer”
。但是根据您提供的示例,您似乎要基于任何重叠加入。可以使用type =“ Any”
来完成,但这是默认值,因此您无需指定。You can use the
fuzzyjoin
package to join based on intervals with thefuzzyjoin::interval_*_join()
functions.I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.
As you can see, the
fuzzy_inner_join()
preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases inmain_df
that match multiple cases inlookup_df
. Thus, we do some cleanup of the joined table:Finally, a clarification of terminology: In your question you state you want to join based on the interval from
main_df
falling within the interval fromlookup_df
. This is possible by usingtype = "within"
ininterval_*_join()
. But based on the examples you provide, it appears you want to join based on any overlap. This can be done withtype = "any"
, but it is the default, so you don't need to specify it.