每月使用服务的日期次数统计

发布于 2025-01-11 15:14:24 字数 1006 浏览 4 评论 0原文

我目前正在重新整理健康服务数据。我的数据框包括每个人使用服务的开始和结束日期，

id <- c("A", "A", "B")
start <- c("2018-04-01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)

 id        start          end
  A    2018-04-01   2019-04-01
  A    2019-04-02   2019-04-05
  B    2018-09-01   2018-09-02

我想做以下事情：（1）计算每个月每个服务使用的日期数； (2) 计算每个人使用服务的日期； (3) 为所有可能的月份构建新的列； (4)生成新的数据框。最终目标是构建以下数据框架：

 id  2018_Jan 2018_Feb 2018_Mar 2018_Apr 2018_May 2018_Jun ... 2018_Sep ... 2019_Sep
  A     0        0         0        30       31       31   ...     30   ...     1
  B     0        0         0         0        0        0   ...      1   ...     0

lubridate 包和 function 命令对此应该有所帮助。我的问题与这篇文章类似 Count日期范围内每个月的天数，它计算每个月的天数。但是，我不确定如何应用它来制定我想要的数据框。

我将非常感谢您在这方面的帮助。

原文

I'm currently re-arranging a health service data. My data frame includes the start and end dates of service use for each individuals

id <- c("A", "A", "B")
start <- c("2018-04-01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)

 id        start          end
  A    2018-04-01   2019-04-01
  A    2019-04-02   2019-04-05
  B    2018-09-01   2018-09-02

I want to do the following things: (1) calculate the number of dates in each month for each service use; (2) calculate dates of service use for each individual; (3) construct new columns for all possible months; and (4) generate a new data frame. The ultimate goal is to construct the following data frame:

 id  2018_Jan 2018_Feb 2018_Mar 2018_Apr 2018_May 2018_Jun ... 2018_Sep ... 2019_Sep
  A     0        0         0        30       31       31   ...     30   ...     1
  B     0        0         0         0        0        0   ...      1   ...     0

The lubridate package and function command should be helpful in this. My question is similar to this post Count the number of days in each month of a date range, where it counted the number of days in each month. However, I'm not sure how to apply it to formulate the data frame that I want.

I will be really grateful for your help on this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凌乱心跳 2025-01-18 15:14:24

这是一个 {tidyverse} 解决方案。

使用 dplyr::summarize() 和 seq() 为每个观察生成完整的日期范围。
- 我在 seq() 中包含 end - 1，以不在计数中包含结束日期，这与您的示例一致。
使用 lubridate::floor_date(unit = "month") 将它们转换为月份（从技术上讲，将每个日期更改为每月的第一天）。
dplyr::count() 每个 id 的月日数。
因为您需要输出中没有观察到的月份的列，所以我编写了一个函数来基于 tidyr::complete() 添加未观察到的月份。
最后，tidyr::pivot_wider() 获取每个月的一列。

library(tidyverse)
library(lubridate)

complete_months <- function(.data, month, ..., fill = list()) {
  month <- pull(.data, {{ month }})
  firstday <- floor_date(min(month, na.rm = TRUE), unit = "year")
  lastday <- ceiling_date(max(month, na.rm = TRUE), unit = "year") - 1
  allmonths <- seq(firstday, lastday, by = "month")
  complete(.data, month = allmonths, ..., fill = fill)
}

month_counts <- df %>%
  mutate(across(start:end, ymd)) %>%
  group_by(id, obs = row_number()) %>%
  summarize(
    # use end - 1 in seq() to omit end date from count
    month = floor_date(seq(start, end - 1, by = 1), unit = "month"),
    .groups = "drop"
  ) %>% 
  count(month, id) %>% 
  complete_months(month, id, fill = list(n = 0)) %>% 
  mutate(month = strftime(month, "%Y_%b")) %>% 
  pivot_wider(
    names_from = month,
    values_from = n
  )

month_counts

# # A tibble: 2 x 25
#   id    `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_May` `2018_Jun`
#   <chr>      <int>      <int>      <int>      <int>      <int>      <int>
# 1 A              0          0          0         30         31         30
# 2 B              0          0          0          0          0          0
# # ... with 18 more variables: `2018_Jul` <int>, `2018_Aug` <int>,
# #   `2018_Sep` <int>, `2018_Oct` <int>, `2018_Nov` <int>, `2018_Dec` <int>,
# #   `2019_Jan` <int>, `2019_Feb` <int>, `2019_Mar` <int>, `2019_Apr` <int>,
# #   `2019_May` <int>, `2019_Jun` <int>, `2019_Jul` <int>, `2019_Aug` <int>,
# #   `2019_Sep` <int>, `2019_Oct` <int>, `2019_Nov` <int>, `2019_Dec` <int>

Here's a {tidyverse} solution.

Use dplyr::summarize() and seq() to generate the full range of dates for each observation.
- I include end - 1 in seq() to not include the end date in the count, consistent with your example.
Convert these to months using lubridate::floor_date(unit = "month") (technically, changes each date to the first of the month).
dplyr::count() up month-days for each id.
Because you want columns for months with no observations in your output, I wrote a function to add unobserved months based on tidyr::complete().
Finally, tidyr::pivot_wider() to get a column for each month.

library(tidyverse)
library(lubridate)

complete_months <- function(.data, month, ..., fill = list()) {
  month <- pull(.data, {{ month }})
  firstday <- floor_date(min(month, na.rm = TRUE), unit = "year")
  lastday <- ceiling_date(max(month, na.rm = TRUE), unit = "year") - 1
  allmonths <- seq(firstday, lastday, by = "month")
  complete(.data, month = allmonths, ..., fill = fill)
}

month_counts <- df %>%
  mutate(across(start:end, ymd)) %>%
  group_by(id, obs = row_number()) %>%
  summarize(
    # use end - 1 in seq() to omit end date from count
    month = floor_date(seq(start, end - 1, by = 1), unit = "month"),
    .groups = "drop"
  ) %>% 
  count(month, id) %>% 
  complete_months(month, id, fill = list(n = 0)) %>% 
  mutate(month = strftime(month, "%Y_%b")) %>% 
  pivot_wider(
    names_from = month,
    values_from = n
  )

month_counts

# # A tibble: 2 x 25
#   id    `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_May` `2018_Jun`
#   <chr>      <int>      <int>      <int>      <int>      <int>      <int>
# 1 A              0          0          0         30         31         30
# 2 B              0          0          0          0          0          0
# # ... with 18 more variables: `2018_Jul` <int>, `2018_Aug` <int>,
# #   `2018_Sep` <int>, `2018_Oct` <int>, `2018_Nov` <int>, `2018_Dec` <int>,
# #   `2019_Jan` <int>, `2019_Feb` <int>, `2019_Mar` <int>, `2019_Apr` <int>,
# #   `2019_May` <int>, `2019_Jun` <int>, `2019_Jul` <int>, `2019_Aug` <int>,
# #   `2019_Sep` <int>, `2019_Oct` <int>, `2019_Nov` <int>, `2019_Dec` <int>

回复收藏 0 原文

如梦亦如幻 2025-01-18 15:14:24

这是一种方法。首先，我将 2018 年 1 月到 2019 年 12 月的 id 和年份-月份进行所有组合。然后，我按 id 和年份-月份汇总数据。最后，将两个数据集连接在一起（以确保捕获没有发生任何事情的月份），然后扩大范围。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
id <- c("A", "A", "B")
start <- c("2018/04/01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)

all_dates <- expand.grid(id = unique(df$id), 
                         month = c("Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), 
                         year = 2018:2019) %>% 
  mutate(yrmo = paste(year, month, sep="_")) %>% 
  select(id, yrmo)

df <- df %>% 
  mutate(start = ymd(start), 
         end = ymd(end)) %>% 
  rowwise() %>% 
  summarise(id = id, obs = 1, dates = seq(start, end, by=1)) %>% 
  mutate(yrmo = paste(year(dates), month(dates, label=TRUE, abbr=TRUE), sep="_")) %>% 
  group_by(id, yrmo) %>% 
  summarise(obs = n()) %>% 
  full_join(., all_dates) %>% 
  mutate(yrmo = factor(yrmo, levels = all_dates$yrmo[which(all_dates$id == "A")])) %>% 
  arrange(id, yrmo) %>%
  pivot_wider(names_from="yrmo", values_from="obs") %>% 
  mutate(across(everything(), ~ifelse(is.na(.x), 0, .x)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> Joining, by = c("id", "yrmo")

df
#> # A tibble: 2 × 24
#> # Groups:   id [2]
#>   id    `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_Jun` `2018_Jul`
#>   <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 A              0          0          0         30         30         31
#> 2 B              0          0          0          0          0          0
#> # … with 17 more variables: `2018_Aug` <dbl>, `2018_Sep` <int>,
#> #   `2018_Oct` <dbl>, `2018_Nov` <dbl>, `2018_Dec` <dbl>, `2019_Jan` <dbl>,
#> #   `2019_Feb` <dbl>, `2019_Mar` <dbl>, `2019_Apr` <dbl>, `2019_Jun` <dbl>,
#> #   `2019_Jul` <dbl>, `2019_Aug` <dbl>, `2019_Sep` <dbl>, `2019_Oct` <dbl>,
#> #   `2019_Nov` <dbl>, `2019_Dec` <dbl>, `NA` <dbl>

^{由 reprex 包 (v2.0.1) 创建于 2022 年 3 月 4 日}

Here's one way. First I make all combinations of id, and year-months from jan 2018 to dec 2019. Then, I summarize the data by id and year-month. Finally, join the two datasets together (to make sure you capture the months where nothing happened) and then pivot wider.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
id <- c("A", "A", "B")
start <- c("2018/04/01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)

all_dates <- expand.grid(id = unique(df$id), 
                         month = c("Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), 
                         year = 2018:2019) %>% 
  mutate(yrmo = paste(year, month, sep="_")) %>% 
  select(id, yrmo)

df <- df %>% 
  mutate(start = ymd(start), 
         end = ymd(end)) %>% 
  rowwise() %>% 
  summarise(id = id, obs = 1, dates = seq(start, end, by=1)) %>% 
  mutate(yrmo = paste(year(dates), month(dates, label=TRUE, abbr=TRUE), sep="_")) %>% 
  group_by(id, yrmo) %>% 
  summarise(obs = n()) %>% 
  full_join(., all_dates) %>% 
  mutate(yrmo = factor(yrmo, levels = all_dates$yrmo[which(all_dates$id == "A")])) %>% 
  arrange(id, yrmo) %>%
  pivot_wider(names_from="yrmo", values_from="obs") %>% 
  mutate(across(everything(), ~ifelse(is.na(.x), 0, .x)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> Joining, by = c("id", "yrmo")

df
#> # A tibble: 2 × 24
#> # Groups:   id [2]
#>   id    `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_Jun` `2018_Jul`
#>   <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 A              0          0          0         30         30         31
#> 2 B              0          0          0          0          0          0
#> # … with 17 more variables: `2018_Aug` <dbl>, `2018_Sep` <int>,
#> #   `2018_Oct` <dbl>, `2018_Nov` <dbl>, `2018_Dec` <dbl>, `2019_Jan` <dbl>,
#> #   `2019_Feb` <dbl>, `2019_Mar` <dbl>, `2019_Apr` <dbl>, `2019_Jun` <dbl>,
#> #   `2019_Jul` <dbl>, `2019_Aug` <dbl>, `2019_Sep` <dbl>, `2019_Oct` <dbl>,
#> #   `2019_Nov` <dbl>, `2019_Dec` <dbl>, `NA` <dbl>

^{Created on 2022-03-04 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~