在不重叠的间隔上合并两个数据帧
我想合并两个数据框。 它们都有开始日期和结束日期。
如果给定的间隔重叠,我想将结果行分割为不重叠的间隔。
请看这个例子:(
a
id beg_a end_a prop_a
1 2000-01-01 2002-12-31 A
2 2000-01-01 2000-02-15 B
2 2000-04-01 2000-04-15 A
2 2002-01-01 2002-12-31 B
3 2000-01-01 2000-06-15 A
b
id beg_b end_b prop_b
1 1999-06-01 2000-05-15 D
1 2003-01-15 2003-01-31 D
2 1999-01-01 2003-01-15 D
3 2000-07-01 2001-08-01 E
merged
id beg_a end_a prop_a beg_b end_b prop_b overallBeg overallEnd
1 <NA> <NA> <NA> 1999-06-01 2000-05-15 D 1999-06-01 1999-12-31
1 2000-01-01 2002-12-31 A 1999-06-01 2000-05-15 D 2000-01-01 2000-05-15
1 2000-01-01 2002-12-31 A <NA> <NA> <NA> 2000-05-16 2002-12-31
1 <NA> <NA> <NA> 2003-01-15 2003-01-31 D 2003-01-15 2003-01-31
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 1999-01-01 1999-12-31
2 2000-01-01 2000-02-15 B 1999-01-01 2003-01-15 D 2000-01-01 2000-02-15
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2000-02-16 2000-03-31
2 2000-04-01 2000-04-15 A 1999-01-01 2003-01-15 D 2000-04-01 2000-04-15
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2000-04-16 2001-12-31
2 2002-01-01 2002-12-31 B 1999-01-01 2003-01-15 D 2002-01-01 2002-12-31
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2003-01-01 2003-01-15
3 2000-01-01 2000-06-15 A <NA> <NA> <NA> 2000-01-01 2000-06-15
3 <NA> <NA> <NA> 2000-07-01 2001-08-01 E 2000-07-01 2001-08-01
或者简单地在 R 中使用这些命令)
a <- structure(list(id = c(1, 2, 2, 2, 3), beg_a = structure(c(10957,
10957, 11048, 11688, 10957), class = "Date"), end_a = structure(c(12052,
11002, 11062, 12052, 11123), class = "Date"), prop_a = structure(c(1L,
2L, 1L, 2L, 1L), .Label = c("A", "B"), class = "factor")), .Names = c("id",
"beg_a", "end_a", "prop_a"), row.names = c(NA, -5L), class = "data.frame")
b <- structure(list(id = c(1, 1, 2, 3), beg_b = structure(c(10743,
12067, 10592, 11139), class = "Date"), end_b = structure(c(11092,
12083, 12067, 11535), class = "Date"), prop_b = structure(c(1L,
1L, 1L, 2L), .Label = c("D", "E"), class = "factor")), .Names = c("id",
"beg_b", "end_b", "prop_b"), row.names = c(NA, -4L), class = "data.frame")
merged <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3),
beg_a = structure(c(NA, 10957, 10957, NA, NA, 10957, NA,
11048, NA, 11688, NA, 10957, NA), class = "Date"), end_a = structure(c(NA,
12052, 12052, NA, NA, 11002, NA, 11062, NA, 12052, NA, 11123,
NA), class = "Date"), prop_a = structure(c(NA, 1L, 1L, NA,
NA, 2L, NA, 1L, NA, 2L, NA, 1L, NA), .Label = c("A", "B"), class = "factor"),
beg_b = structure(c(10743, 10743, NA, 12067, 10592, 10592,
10592, 10592, 10592, 10592, 10592, NA, 11139), class = "Date"),
end_b = structure(c(11092, 11092, NA, 12083, 12067, 12067,
12067, 12067, 12067, 12067, 12067, NA, 11535), class = "Date"),
prop_b = structure(c(1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, 2L), .Label = c("D", "E"), class = "factor"),
overallBeg = structure(c(10743, 10957, 11093, 12067, 10592,
10957, 11003, 11048, 11063, 11688, 12053, 10957, 11139), class = "Date"),
overallEnd = structure(c(10956, 11092, 12052, 12083, 10956,
11002, 11047, 11062, 11687, 12052, 12067, 11123, 11535), class = "Date")), .Names = c("id",
"beg_a", "end_a", "prop_a", "beg_b", "end_b", "prop_b", "overallBeg",
"overallEnd"), row.names = c(NA, -13L), class = "data.frame")
我认为与我的另一个问题有一些相似之处: “平滑”时间数据 - 可以更高效吗?< /a>
但也略有不同。
预先感谢您的帮助!
I want to merge two data frames.
Both of them have a begin date and an end date.
If the given intervals are overlapping, I want to split the resulting rows in non overlapping intevals.
Please see this example:
a
id beg_a end_a prop_a
1 2000-01-01 2002-12-31 A
2 2000-01-01 2000-02-15 B
2 2000-04-01 2000-04-15 A
2 2002-01-01 2002-12-31 B
3 2000-01-01 2000-06-15 A
b
id beg_b end_b prop_b
1 1999-06-01 2000-05-15 D
1 2003-01-15 2003-01-31 D
2 1999-01-01 2003-01-15 D
3 2000-07-01 2001-08-01 E
merged
id beg_a end_a prop_a beg_b end_b prop_b overallBeg overallEnd
1 <NA> <NA> <NA> 1999-06-01 2000-05-15 D 1999-06-01 1999-12-31
1 2000-01-01 2002-12-31 A 1999-06-01 2000-05-15 D 2000-01-01 2000-05-15
1 2000-01-01 2002-12-31 A <NA> <NA> <NA> 2000-05-16 2002-12-31
1 <NA> <NA> <NA> 2003-01-15 2003-01-31 D 2003-01-15 2003-01-31
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 1999-01-01 1999-12-31
2 2000-01-01 2000-02-15 B 1999-01-01 2003-01-15 D 2000-01-01 2000-02-15
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2000-02-16 2000-03-31
2 2000-04-01 2000-04-15 A 1999-01-01 2003-01-15 D 2000-04-01 2000-04-15
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2000-04-16 2001-12-31
2 2002-01-01 2002-12-31 B 1999-01-01 2003-01-15 D 2002-01-01 2002-12-31
2 <NA> <NA> <NA> 1999-01-01 2003-01-15 D 2003-01-01 2003-01-15
3 2000-01-01 2000-06-15 A <NA> <NA> <NA> 2000-01-01 2000-06-15
3 <NA> <NA> <NA> 2000-07-01 2001-08-01 E 2000-07-01 2001-08-01
(or simply use these commands in R)
a <- structure(list(id = c(1, 2, 2, 2, 3), beg_a = structure(c(10957,
10957, 11048, 11688, 10957), class = "Date"), end_a = structure(c(12052,
11002, 11062, 12052, 11123), class = "Date"), prop_a = structure(c(1L,
2L, 1L, 2L, 1L), .Label = c("A", "B"), class = "factor")), .Names = c("id",
"beg_a", "end_a", "prop_a"), row.names = c(NA, -5L), class = "data.frame")
b <- structure(list(id = c(1, 1, 2, 3), beg_b = structure(c(10743,
12067, 10592, 11139), class = "Date"), end_b = structure(c(11092,
12083, 12067, 11535), class = "Date"), prop_b = structure(c(1L,
1L, 1L, 2L), .Label = c("D", "E"), class = "factor")), .Names = c("id",
"beg_b", "end_b", "prop_b"), row.names = c(NA, -4L), class = "data.frame")
merged <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3),
beg_a = structure(c(NA, 10957, 10957, NA, NA, 10957, NA,
11048, NA, 11688, NA, 10957, NA), class = "Date"), end_a = structure(c(NA,
12052, 12052, NA, NA, 11002, NA, 11062, NA, 12052, NA, 11123,
NA), class = "Date"), prop_a = structure(c(NA, 1L, 1L, NA,
NA, 2L, NA, 1L, NA, 2L, NA, 1L, NA), .Label = c("A", "B"), class = "factor"),
beg_b = structure(c(10743, 10743, NA, 12067, 10592, 10592,
10592, 10592, 10592, 10592, 10592, NA, 11139), class = "Date"),
end_b = structure(c(11092, 11092, NA, 12083, 12067, 12067,
12067, 12067, 12067, 12067, 12067, NA, 11535), class = "Date"),
prop_b = structure(c(1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, 2L), .Label = c("D", "E"), class = "factor"),
overallBeg = structure(c(10743, 10957, 11093, 12067, 10592,
10957, 11003, 11048, 11063, 11688, 12053, 10957, 11139), class = "Date"),
overallEnd = structure(c(10956, 11092, 12052, 12083, 10956,
11002, 11047, 11062, 11687, 12052, 12067, 11123, 11535), class = "Date")), .Names = c("id",
"beg_a", "end_a", "prop_a", "beg_b", "end_b", "prop_b", "overallBeg",
"overallEnd"), row.names = c(NA, -13L), class = "data.frame")
I think there are some similarities with another question of mine:
"smoothing" time data - can it be done more efficient?
But also slightly different.
Thank you in advance for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(2)
深陷2025-01-03 04:46:12
您可以分两步完成:
首先,计算所有所需的间隔并将它们放入中间表中,然后将该表与两个初始数据帧连接起来。
# First build all the desired intervals
names(a) <- c( "id", "valid_from", "valid_until", "prop_a" )
names(b) <- c( "id", "valid_from", "valid_until", "prop_b" )
intervals <- rbind(
data.frame( id = a$id, date = a$valid_from ),
data.frame( id = a$id, date = a$valid_until ),
data.frame( id = b$id, date = b$valid_from ),
data.frame( id = b$id, date = b$valid_until )
)
intervals <- unique( intervals )
intervals <- intervals[ order(intervals$id, intervals$date), ]
n <- dim(intervals)[1]
intervals <- data.frame(
id = intervals$id[-n],
id2 = intervals$id[-1],
valid_from = intervals$date[-n],
valid_until = intervals$date[-1]
)
intervals <- intervals[
intervals$id == intervals$id2,
c("id", "valid_from", "valid_until")
]
由于我们连接数据的条件不是简单的相等,因此我们使用 sqldf。
library(sqldf)
d <- sqldf( "
SELECT intervals.id,
intervals.valid_from, intervals.valid_until,
a.prop_a, b.prop_b
FROM intervals
LEFT JOIN a
ON a.valid_from <= intervals.valid_from
AND intervals.valid_until <= a.valid_until
AND intervals.id = a.id
LEFT JOIN b
ON b.valid_from <= intervals.valid_from
AND intervals.valid_until <= b.valid_until
AND intervals.id = b.id
" )
~没有更多了~
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
sqldf 可以工作,但我尝试了“纯”R 解决方案。它有效,但有点草率。我还没有弄清楚如何“矢量化”解决方案(删除 split.interval 中的两个 for 循环,并删除对 id.split 进行 lapply 的需要)。
首先,我创建两个可以采用一个 id 的函数,并将“a”和“b”合并在一起:
然后我为每个 ID 运行该函数,并将它们全部绑定在一起。
sqldf will work, but I tried a 'pure' R solution. It works, but it is a little sloppy. I haven't figured out how to 'vectorize' the solution (remove the two for loops in the split.interval, and remove the need to lapply over id.split).
First I create two functions that can take one id, and merge 'a' and 'b' together:
Then I run the function for every ID, and bind them all together.