在 R (dplyr) 中计算重叠日期
我有一个个人数据集(CSN),每个人在入院期间都接受过从零到多次的干预(在本例中,放置了中心线),每个干预都有开始和结束日期。我正在尝试生成一个新的日期范围来计算任何重叠的日期。换句话说,我试图计算一个人有一条中心线时的总日期范围。
例如数据:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(1514937600,
1514937600, 1515024000, 1515024000, 1515110400, 1515974400, 1516147200,
1516147200, 1516147200, 1516233600, 1516233600, 1517097600, 1517097600,
1517702400, 1517356800, 1518220800, 1519257600, 1519948800, 1520812800,
1521504000, 1522022400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Last_day = structure(c(1515628800, 1515110400, 1515542400,
1515542400, 1515628800, 1516579200, 1516320000, 1517184000, 1516233600,
1517184000, 1517702400, 1517184000, 1517616000, 1517702400, 1518220800,
1518825600, 1519689600, 1520812800, 1521763200, 1522108800, 1522108800
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
理想情况下,输出将为所有重叠日期返回单个日期范围,但如果每个日期都错过了一段日期,则将创建一个新的间隔。因此,对于第 1 组,第 1-5 行的 start = 2018-01-03 和 end = 2018-01-11,但第 6 行的 start = 2018-01-15 和 end = 2018-01-22 。
我尝试执行以下操作:
df %>%
arrange(CSN_id, First_day) %>%
mutate(First_day = ymd(First_day),
Last_day = ymd(Last_day),
start = ymd("1970-01-01"),
end = ymd("1970-01-01")) %>%
group_by(CSN_id) %>%
rowwise() %>%
mutate(test = if_else(row_number() == 1, interval(First_day, Last_day), interval(lag(start), lag(end))),
start = if_else(row_number() == 1, First_day,
if_else(First_day <= lag(end), lag(First_day), First_day)),
end = if_else(row_number() == 1, Last_day,
if_else(Last_day %within% lag(test) == TRUE, lag(end), Last_day)))
但是,我认为滞后函数没有按预期工作,并且由于某种原因它总是返回 Last_day。我尝试摆脱 rowwise,但随后间隔变得混乱(一直停留在 1970 年代)。
我得到的输出是:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(17534,
17534, 17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549,
17549, 17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602,
17610, 17616), class = "Date"), Last_day = structure(c(17542,
17536, 17541, 17541, 17542, 17553, 17550, 17560, 17549, 17560,
17566, 17560, 17565, 17566, 17572, 17579, 17589, 17602, 17613,
17617, 17617), class = "Date"), start = structure(c(17534, 17534,
17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549, 17549,
17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602, 17610,
17616), class = "Date"), end = structure(c(17542, 17536, 17541,
17541, 17542, 17553, 17550, 17560, 17549, 17560, 17566, 17560,
17565, 17566, 17572, 17579, 17589, 17602, 17613, 17617, 17617
), class = "Date"), test = new("Interval", .Data = c(691200,
172800, 518400, 518400, 518400, 604800, 172800, 1036800, 86400,
950400, 1468800, 86400, 518400, 0, 864000, 604800, 432000, 864000,
950400, 604800, 86400), start = structure(c(1514937600, 1514937600,
1515024000, 1515024000, 1515110400, 1515974400, 1516147200, 1516147200,
1516147200, 1516233600, 1516233600, 1517097600, 1517097600, 1517702400,
1517356800, 1518220800, 1519257600, 1519948800, 1520812800, 1521504000,
1522022400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
tzone = "UTC")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -21L), groups = structure(list(
CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame")))
我是否遗漏了一些明显的东西?任何帮助将不胜感激!
I have a dataset of individuals (CSN), each of whom has had anywhere from zero to multiple interventions performed during a hospital admission (in this case, central lines placed), each with a start and an end date. I am trying to generate a new date range that will calculate any overlapping dates. In other words, I'm trying to calculate the total date range when an individual had a central line in place.
Data for example:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(1514937600,
1514937600, 1515024000, 1515024000, 1515110400, 1515974400, 1516147200,
1516147200, 1516147200, 1516233600, 1516233600, 1517097600, 1517097600,
1517702400, 1517356800, 1518220800, 1519257600, 1519948800, 1520812800,
1521504000, 1522022400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Last_day = structure(c(1515628800, 1515110400, 1515542400,
1515542400, 1515628800, 1516579200, 1516320000, 1517184000, 1516233600,
1517184000, 1517702400, 1517184000, 1517616000, 1517702400, 1518220800,
1518825600, 1519689600, 1520812800, 1521763200, 1522108800, 1522108800
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
Ideally, the output would return a single date range for all overlapping dates, but if there were a stretch of days that are missed by each, then a new interval would be created. So, for group 1, rows 1-5 would all have start = 2018-01-03 and end = 2018-01-11, but then row 6 would have start = 2018-01-15 and end = 2018-01-22.
I've tried to do the following:
df %>%
arrange(CSN_id, First_day) %>%
mutate(First_day = ymd(First_day),
Last_day = ymd(Last_day),
start = ymd("1970-01-01"),
end = ymd("1970-01-01")) %>%
group_by(CSN_id) %>%
rowwise() %>%
mutate(test = if_else(row_number() == 1, interval(First_day, Last_day), interval(lag(start), lag(end))),
start = if_else(row_number() == 1, First_day,
if_else(First_day <= lag(end), lag(First_day), First_day)),
end = if_else(row_number() == 1, Last_day,
if_else(Last_day %within% lag(test) == TRUE, lag(end), Last_day)))
However, I don't think the lag function is working as intended, and it always returns Last_day for some reason. I tried getting rid of rowwise, but then the intervals get messed up (persistently stuck in 1970s).
The output I'm getting is:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(17534,
17534, 17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549,
17549, 17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602,
17610, 17616), class = "Date"), Last_day = structure(c(17542,
17536, 17541, 17541, 17542, 17553, 17550, 17560, 17549, 17560,
17566, 17560, 17565, 17566, 17572, 17579, 17589, 17602, 17613,
17617, 17617), class = "Date"), start = structure(c(17534, 17534,
17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549, 17549,
17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602, 17610,
17616), class = "Date"), end = structure(c(17542, 17536, 17541,
17541, 17542, 17553, 17550, 17560, 17549, 17560, 17566, 17560,
17565, 17566, 17572, 17579, 17589, 17602, 17613, 17617, 17617
), class = "Date"), test = new("Interval", .Data = c(691200,
172800, 518400, 518400, 518400, 604800, 172800, 1036800, 86400,
950400, 1468800, 86400, 518400, 0, 864000, 604800, 432000, 864000,
950400, 604800, 86400), start = structure(c(1514937600, 1514937600,
1515024000, 1515024000, 1515110400, 1515974400, 1516147200, 1516147200,
1516147200, 1516233600, 1516233600, 1517097600, 1517097600, 1517702400,
1517356800, 1518220800, 1519257600, 1519948800, 1520812800, 1521504000,
1522022400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
tzone = "UTC")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -21L), groups = structure(list(
CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame")))
Is there something obvious I'm missing? Any help would be much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不太确定您想要的输出是什么,但您可以尝试以下方法:
输出:
如果您希望保留所有原始行,并且所有日期都是日期而不是日期时间,您也可以执行以下操作:
输出:
I'm not exactly sure what your desired output is, but you can try this approach:
Output:
If you prefer to retain all the original rows, and all the dates are dates and not datetimes, you could also do something like this:
Output:
这是使用 Bioconductor 上的
IRanges
包的另一个选项< /a>.collapse_date_ranges
函数取自此处,我只是根据输出进行了调整
在原始数据框中,那么我们可以将数据连接回原始数据框中,然后使用
fill
将日期添加到每行。输出
Here is another option using the
IRanges
package on Bioconductor. Thecollapse_date_ranges
function is taken from here, and I just adjusted accordingOutput
If you want to have this in the original dataframe, then we can join the data back to the original dataframe, then use
fill
to add the dates to each row.Output