小组按日期差异

发布于 2025-01-19 09:39:11 字数 1378 浏览 1 评论 0原文

我想按日期差异进行分组。

例如，如果设施 A 有 7 个病例，但前 5 个病例发生在最后 2 个病例的 14 天之前，我希望它们位于两个不同的组中（参见下面的示例）

位置	地址	start_date	start_date_diff	Group
Facility A	123 main 2/7/2022	1	0	1
设施 A	主街	2/11/2022	4	123
设施 A	123 主街	2/11/2022	0	1
设施 A	main st	2/11/2022	0	1
设施 A	123	123 main st 2/12/2022	1	1
设施 A	123 main st	3/12/2022	28	2
设施 A	123 main st	3/ 17/2022	5	2
设施 B	55福特路	3/16/2022	0	3
设施 B	55 福特路	3/16/2022	0	3
设施 C	1 阶梯大道	3/16/2022	0	4
设施 C	1 阶梯大道	3/20/2022	4	4
设施 C	1 阶梯大道	3/22/2022	2	4

这是我的代码，所以到目前为止：

我被困在如何通过个体观察之间的日期差异进一步对它们进行分组。

原文

I want to group within a group by date difference.

For example, if there are 7 cases in facility A, but the first 5 cases happened before 14 days of the last 2 cases, I want them to be in two different groups (see below example)

location	address	start_date	start_date_diff	Group
Facility A	123 main st	2/7/2022	0	1
Facility A	123 main st	2/11/2022	4	1
Facility A	123 main st	2/11/2022	0	1
Facility A	123 main st	2/11/2022	0	1
Facility A	123 main st	2/12/2022	1	1
Facility A	123 main st	3/12/2022	28	2
Facility A	123 main st	3/17/2022	5	2
Facility B	55 ford rd	3/16/2022	0	3
Facility B	55 ford rd	3/16/2022	0	3
Facility C	1 step ave	3/16/2022	0	4
Facility C	1 step ave	3/20/2022	4	4
Facility C	1 step ave	3/22/2022	2	4

here is my code so far:

I am stuck on how to group them further by the date difference between individual observations.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离鸿 2025-01-26 09:39:11

假设我们还没有计算 diff，并且我们需要将 start_date 转换为算术上有用的东西。

data.table

library(data.table)
as.data.table(dat)[, start_date := as.Date(start_date, format = "%m/%d/%Y")
  ][, diff14 := cumsum(c(0, diff(start_date)) > 14), by = location
  ][, Group2 := rleid(location, diff14)][]
#       location     address start_date start_date_diff Group diff14 Group2
#         <char>      <char>     <Date>           <int> <int>  <int>  <int>
#  1: Facility A 123 main st 2022-02-07               0     1      0      1
#  2: Facility A 123 main st 2022-02-11               4     1      0      1
#  3: Facility A 123 main st 2022-02-11               0     1      0      1
#  4: Facility A 123 main st 2022-02-11               0     1      0      1
#  5: Facility A 123 main st 2022-02-12               1     1      0      1
#  6: Facility A 123 main st 2022-03-12              28     2      1      2
#  7: Facility A 123 main st 2022-03-17               5     2      1      2
#  8: Facility B  55 ford rd 2022-03-16               0     3      0      3
#  9: Facility B  55 ford rd 2022-03-16               0     3      0      3
# 10: Facility C  1 step ave 2022-03-16               0     4      0      4
# 11: Facility C  1 step ave 2022-03-20               4     4      0      4
# 12: Facility C  1 step ave 2022-03-22               2     4      0      4

dplyr

library(dplyr)
dat %>%
  mutate(start_date = as.Date(start_date, format = "%m/%d/%Y")) %>%
  group_by(location) %>%
  mutate(diff14 = cumsum(c(0, diff(start_date)) > 14)) %>%
  group_by(location, diff14) %>%
  mutate(Group2 = cur_group_id()) %>%
  ungroup()
# # A tibble: 12 x 7
#    location   address     start_date start_date_diff Group diff14 Group2
#    <chr>      <chr>       <date>               <int> <int>  <int>  <int>
#  1 Facility A 123 main st 2022-02-07               0     1      0      1
#  2 Facility A 123 main st 2022-02-11               4     1      0      1
#  3 Facility A 123 main st 2022-02-11               0     1      0      1
#  4 Facility A 123 main st 2022-02-11               0     1      0      1
#  5 Facility A 123 main st 2022-02-12               1     1      0      1
#  6 Facility A 123 main st 2022-03-12              28     2      1      2
#  7 Facility A 123 main st 2022-03-17               5     2      1      2
#  8 Facility B 55 ford rd  2022-03-16               0     3      0      3
#  9 Facility B 55 ford rd  2022-03-16               0     3      0      3
# 10 Facility C 1 step ave  2022-03-16               0     4      0      4
# 11 Facility C 1 step ave  2022-03-20               4     4      0      4
# 12 Facility C 1 step ave  2022-03-22               2     4      0      4

数据

read.md <- structure(list(location = c("Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility B", "Facility B", "Facility C", "Facility C", "Facility C"), address = c("123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "55 ford rd", "55 ford rd", "1 step ave", "1 step ave", "1 step ave"), start_date = c("2/7/2022", "2/11/2022", "2/11/2022", "2/11/2022", "2/12/2022", "3/12/2022", "3/17/2022", "3/16/2022",  "3/16/2022", "3/16/2022", "3/20/2022", "3/22/2022"), start_date_diff = c(0L, 4L, 0L, 0L, 1L, 28L, 5L, 0L, 0L, 0L, 4L, 2L), Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, -12L))

Assuming we don't already diff calculated, and that we need to convert start_date into something arithmetically useful.

data.table

library(data.table)
as.data.table(dat)[, start_date := as.Date(start_date, format = "%m/%d/%Y")
  ][, diff14 := cumsum(c(0, diff(start_date)) > 14), by = location
  ][, Group2 := rleid(location, diff14)][]
#       location     address start_date start_date_diff Group diff14 Group2
#         <char>      <char>     <Date>           <int> <int>  <int>  <int>
#  1: Facility A 123 main st 2022-02-07               0     1      0      1
#  2: Facility A 123 main st 2022-02-11               4     1      0      1
#  3: Facility A 123 main st 2022-02-11               0     1      0      1
#  4: Facility A 123 main st 2022-02-11               0     1      0      1
#  5: Facility A 123 main st 2022-02-12               1     1      0      1
#  6: Facility A 123 main st 2022-03-12              28     2      1      2
#  7: Facility A 123 main st 2022-03-17               5     2      1      2
#  8: Facility B  55 ford rd 2022-03-16               0     3      0      3
#  9: Facility B  55 ford rd 2022-03-16               0     3      0      3
# 10: Facility C  1 step ave 2022-03-16               0     4      0      4
# 11: Facility C  1 step ave 2022-03-20               4     4      0      4
# 12: Facility C  1 step ave 2022-03-22               2     4      0      4

dplyr

library(dplyr)
dat %>%
  mutate(start_date = as.Date(start_date, format = "%m/%d/%Y")) %>%
  group_by(location) %>%
  mutate(diff14 = cumsum(c(0, diff(start_date)) > 14)) %>%
  group_by(location, diff14) %>%
  mutate(Group2 = cur_group_id()) %>%
  ungroup()
# # A tibble: 12 x 7
#    location   address     start_date start_date_diff Group diff14 Group2
#    <chr>      <chr>       <date>               <int> <int>  <int>  <int>
#  1 Facility A 123 main st 2022-02-07               0     1      0      1
#  2 Facility A 123 main st 2022-02-11               4     1      0      1
#  3 Facility A 123 main st 2022-02-11               0     1      0      1
#  4 Facility A 123 main st 2022-02-11               0     1      0      1
#  5 Facility A 123 main st 2022-02-12               1     1      0      1
#  6 Facility A 123 main st 2022-03-12              28     2      1      2
#  7 Facility A 123 main st 2022-03-17               5     2      1      2
#  8 Facility B 55 ford rd  2022-03-16               0     3      0      3
#  9 Facility B 55 ford rd  2022-03-16               0     3      0      3
# 10 Facility C 1 step ave  2022-03-16               0     4      0      4
# 11 Facility C 1 step ave  2022-03-20               4     4      0      4
# 12 Facility C 1 step ave  2022-03-22               2     4      0      4

Data

read.md <- structure(list(location = c("Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility A", "Facility B", "Facility B", "Facility C", "Facility C", "Facility C"), address = c("123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "123 main st", "55 ford rd", "55 ford rd", "1 step ave", "1 step ave", "1 step ave"), start_date = c("2/7/2022", "2/11/2022", "2/11/2022", "2/11/2022", "2/12/2022", "3/12/2022", "3/17/2022", "3/16/2022",  "3/16/2022", "3/16/2022", "3/20/2022", "3/22/2022"), start_date_diff = c(0L, 4L, 0L, 0L, 1L, 28L, 5L, 0L, 0L, 0L, 4L, 2L), Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, -12L))

回复收藏 0 原文

~没有更多了~