标记持续观察并创建入学率

发布于 2025-01-28 18:59:04 字数 1207 浏览 4 评论 0 原文

我有一些大型注册数据集，我正在尝试创建两件事：

我想标记每个不间断的每月观察（ final_df1 ），
我想创建一个不间断跨度的数据集（< 我觉得{lubritation}和{data.table}之间

例如：

library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
           date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))

final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))

final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
                  span_start=c(ymd("20120101")%m+%months(1),
                               ymd("20120101")%m+%months(7),
                               ymd("20120401")%m+%months(1),
                               ymd("20120101")%m+%months(12),
                               ymd("20120101")%m+%months(19)),
                  span_end=c(ymd("20120101")%m+%months(5),
                             ymd("20120101")%m+%months(11),
                             ymd("20120101")%m+%months(10),
                             ymd("20120101")%m+%months(17),
                             ymd("20120101")%m+%months(22))
                  )

必须有一种简单的方法来执行此操作，但我觉得我正在绘制空白。有技巧吗？

原文

I have a few large enrolment datasets and I'm trying to create two things:

I'd like to flag each uninterrupted monthly observation (final_df1)
I'd like to create a dataset of uninterrupted spans (final_df2)

For example:

library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
           date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))

final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))

final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
                  span_start=c(ymd("20120101")%m+%months(1),
                               ymd("20120101")%m+%months(7),
                               ymd("20120401")%m+%months(1),
                               ymd("20120101")%m+%months(12),
                               ymd("20120101")%m+%months(19)),
                  span_end=c(ymd("20120101")%m+%months(5),
                             ymd("20120101")%m+%months(11),
                             ymd("20120101")%m+%months(10),
                             ymd("20120101")%m+%months(17),
                             ymd("20120101")%m+%months(22))
                  )

I feel like there must be a simple way to do this between {lubridate} and {data.table} but I'm drawing up blanks. Any tips?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫七 2025-02-04 18:59:04

由“ ID”分组，创建一个 Interval ，其先前的“日期”（ lag ）和当前的“日期”，除以月份，检查它是否小于2，然后采取累积最小值（ Cummin ）。创建“ find_df_new'之后，然后我们按'id'和'cont_enroll'列的运行长度ID进行分组，总结 first 和>最后“日期”的值分别创建“ span_start”和“ span_end”

library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
   group_by(id)  %>%
   mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)), 
       date) /months(1) < 2))  %>%
   ungroup
final_df_new %>% 
  group_by(id, grp = rleid(cont_enroll2))  %>%
  summarise(span_start = first(date), span_end = last(date), .groups = 'drop')

Grouped by 'id', create an interval with the previous value of 'date' (lag) and the current 'date', divide by the months, check if it is less than 2, and take the cumulative minimum (cummin). After creating the 'find_df_new', then we group by 'id' and the run-length-id of 'cont_enroll' column, and summarise with the first and last value of 'date' to create the 'span_start' and 'span_end' respectively

library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
   group_by(id)  %>%
   mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)), 
       date) /months(1) < 2))  %>%
   ungroup
final_df_new %>% 
  group_by(id, grp = rleid(cont_enroll2))  %>%
  summarise(span_start = first(date), span_end = last(date), .groups = 'drop')

回复收藏 0 原文

淤浪 2025-02-04 18:59:04

我认为您可以用 ivs 软件包来解决这个问题。您的日期似乎确实代表了1个月的间隔，而IVS软件包专门用于处理此类数据。

我们可以用 final_df2 用 iv_groups（）来返回合并所有重叠间隔后保留的非重叠间隔。

然后，每个组的第一行 final_df2 表示第一个连续间隔，因此您只需要检查每个范围是否在该间隔内，即可确定它是否是不间断设置的一部分以获取 final_df1 。

请注意，我的 final_df2 看起来与您的有可能在编码方式上有错误吗？

library(dplyr)
library(lubridate)
library(ivs)

df <- tibble(
  id = c(
    rep("X", 10), 
    rep("Y", 20)
  ),
  date = c(
    ymd("20120101") %m+% months(c(1:5,7:11)), 
    ymd("20120401") %m+% months(c(1:10,12:17,19:22))
  )
)
df
#> # A tibble: 30 × 2
#>    id    date      
#>    <chr> <date>    
#>  1 X     2012-02-01
#>  2 X     2012-03-01
#>  3 X     2012-04-01
#>  4 X     2012-05-01
#>  5 X     2012-06-01
#>  6 X     2012-08-01
#>  7 X     2012-09-01
#>  8 X     2012-10-01
#>  9 X     2012-11-01
#> 10 X     2012-12-01
#> # … with 20 more rows

df <- df %>%
  mutate(start = date, end = date + months(1), .keep = "unused") %>%
  mutate(range = iv(start, end), .keep = "unused")

df
#> # A tibble: 30 × 2
#>    id                       range
#>    <chr>               <iv<date>>
#>  1 X     [2012-02-01, 2012-03-01)
#>  2 X     [2012-03-01, 2012-04-01)
#>  3 X     [2012-04-01, 2012-05-01)
#>  4 X     [2012-05-01, 2012-06-01)
#>  5 X     [2012-06-01, 2012-07-01)
#>  6 X     [2012-08-01, 2012-09-01)
#>  7 X     [2012-09-01, 2012-10-01)
#>  8 X     [2012-10-01, 2012-11-01)
#>  9 X     [2012-11-01, 2012-12-01)
#> 10 X     [2012-12-01, 2013-01-01)
#> # … with 20 more rows

# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
  group_by(id) %>%
  summarise(range = iv_groups(range), .groups = "drop")

continuous
#> # A tibble: 5 × 2
#>   id                       range
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 X     [2012-08-01, 2013-01-01)
#> 3 Y     [2012-05-01, 2013-03-01)
#> 4 Y     [2013-04-01, 2013-10-01)
#> 5 Y     [2013-11-01, 2014-03-01)

# The first continuous range per id
first_continuous <- continuous %>%
  group_by(id) %>%
  slice(1) %>%
  ungroup() %>%
  rename(range_continuous = range)

first_continuous
#> # A tibble: 2 × 2
#>   id            range_continuous
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 Y     [2012-05-01, 2013-03-01)

# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
  mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#>    id                       range         range_continuous continuous
#>    <chr>               <iv<date>>               <iv<date>> <lgl>     
#>  1 X     [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE      
#>  2 X     [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE      
#>  3 X     [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE      
#>  4 X     [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE      
#>  5 X     [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE      
#>  6 X     [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE     
#>  7 X     [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE     
#>  8 X     [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE     
#>  9 X     [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE     
#> 10 X     [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE     
#> # … with 20 more rows

^由

I think you can solve this nicely with the ivs package. Your dates seem to really represent 1 month intervals, and the ivs package is dedicated to working with data of this type.

We can compute final_df2 with iv_groups(), which returns the non-overlapping intervals that remain after merging all overlapping intervals.

Then the first row of final_df2 per group represents the first continuous interval, so you just need to check if each range is within that interval or not to decide if it is part of the uninterrupted set to get final_df1.

Note that my final_df2 looks different from yours, is it possible that you have an error in how you coded it?

library(dplyr)
library(lubridate)
library(ivs)

df <- tibble(
  id = c(
    rep("X", 10), 
    rep("Y", 20)
  ),
  date = c(
    ymd("20120101") %m+% months(c(1:5,7:11)), 
    ymd("20120401") %m+% months(c(1:10,12:17,19:22))
  )
)
df
#> # A tibble: 30 × 2
#>    id    date      
#>    <chr> <date>    
#>  1 X     2012-02-01
#>  2 X     2012-03-01
#>  3 X     2012-04-01
#>  4 X     2012-05-01
#>  5 X     2012-06-01
#>  6 X     2012-08-01
#>  7 X     2012-09-01
#>  8 X     2012-10-01
#>  9 X     2012-11-01
#> 10 X     2012-12-01
#> # … with 20 more rows

df <- df %>%
  mutate(start = date, end = date + months(1), .keep = "unused") %>%
  mutate(range = iv(start, end), .keep = "unused")

df
#> # A tibble: 30 × 2
#>    id                       range
#>    <chr>               <iv<date>>
#>  1 X     [2012-02-01, 2012-03-01)
#>  2 X     [2012-03-01, 2012-04-01)
#>  3 X     [2012-04-01, 2012-05-01)
#>  4 X     [2012-05-01, 2012-06-01)
#>  5 X     [2012-06-01, 2012-07-01)
#>  6 X     [2012-08-01, 2012-09-01)
#>  7 X     [2012-09-01, 2012-10-01)
#>  8 X     [2012-10-01, 2012-11-01)
#>  9 X     [2012-11-01, 2012-12-01)
#> 10 X     [2012-12-01, 2013-01-01)
#> # … with 20 more rows

# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
  group_by(id) %>%
  summarise(range = iv_groups(range), .groups = "drop")

continuous
#> # A tibble: 5 × 2
#>   id                       range
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 X     [2012-08-01, 2013-01-01)
#> 3 Y     [2012-05-01, 2013-03-01)
#> 4 Y     [2013-04-01, 2013-10-01)
#> 5 Y     [2013-11-01, 2014-03-01)

# The first continuous range per id
first_continuous <- continuous %>%
  group_by(id) %>%
  slice(1) %>%
  ungroup() %>%
  rename(range_continuous = range)

first_continuous
#> # A tibble: 2 × 2
#>   id            range_continuous
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 Y     [2012-05-01, 2013-03-01)

# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
  mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#>    id                       range         range_continuous continuous
#>    <chr>               <iv<date>>               <iv<date>> <lgl>     
#>  1 X     [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE      
#>  2 X     [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE      
#>  3 X     [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE      
#>  4 X     [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE      
#>  5 X     [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE      
#>  6 X     [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE     
#>  7 X     [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE     
#>  8 X     [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE     
#>  9 X     [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE     
#> 10 X     [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE     
#> # … with 20 more rows

^{Created on 2022-05-13 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~