R：计算时间间隔内的行数

发布于 2025-02-11 20:17:55 字数 1915 浏览 1 评论 0原文

让我们假设以下数据框架：

df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
         times = c(as.POSIXct("2021-01-02 10:00:00"),
                   as.POSIXct("2021-01-02 11:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 14:00:00"),
                   as.POSIXct("2021-01-02 15:00:00")))
        ID times              
  <dbl> <dttm>             
1    12 2021-01-02 10:00:00
2    12 2021-01-02 11:00:00
3    12 2021-01-02 13:00:00
4    13 2021-01-02 13:00:00
5    13 2021-01-02 14:00:00
6    13 2021-01-02 15:00:00

我想要的是列，它将ID的每个时间戳视为启动值，并使用下一个2H计算后续观察的数量。因此，这是我的目标：

     ID times               n_obs_within_2h
  <dbl> <dttm>                        <dbl>
1    12 2021-01-02 10:00:00               2
2    12 2021-01-02 11:00:00               2
3    12 2021-01-02 13:00:00               1
4    13 2021-01-02 13:00:00               3
5    13 2021-01-02 14:00:00               2
6    13 2021-01-02 15:00:00               1

我知道，通过在每行上迭代，可以通过purrr :: map轻松完成。但是，我的原始数据集很大，这使得这样做是相当不可能的。除了在每行上迭代以实现计算n_obs_within_2h

编辑：我当前的尝试：

df %>% group_by(ID) %>% 
  mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                         .f = function(i, t, data) {
                                           n <- data %>%
                                             filter(ID == i) %>%
                                             filter(between(as.double.difftime(times-t, units = "hours"),
                                                            0, 2)) %>%
                                             nrow()
                                           return(n)
                                         }, data = .))

原文

let's assume the following dataframe:

df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
         times = c(as.POSIXct("2021-01-02 10:00:00"),
                   as.POSIXct("2021-01-02 11:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 14:00:00"),
                   as.POSIXct("2021-01-02 15:00:00")))
        ID times              
  <dbl> <dttm>             
1    12 2021-01-02 10:00:00
2    12 2021-01-02 11:00:00
3    12 2021-01-02 13:00:00
4    13 2021-01-02 13:00:00
5    13 2021-01-02 14:00:00
6    13 2021-01-02 15:00:00

What I want is a column, that considers each timestamp of an ID as start value and computes the number of subsequent observation with the next 2h. So this is my goal:

     ID times               n_obs_within_2h
  <dbl> <dttm>                        <dbl>
1    12 2021-01-02 10:00:00               2
2    12 2021-01-02 11:00:00               2
3    12 2021-01-02 13:00:00               1
4    13 2021-01-02 13:00:00               3
5    13 2021-01-02 14:00:00               2
6    13 2021-01-02 15:00:00               1

I know that this could be easily done with purrr::map by iterating over each row. However, my original dataset it quite big which makes it fairly unefficient to do so. Can you think of another way than iterating over each row to achieve the computing n_obs_within_2h

EDIT: my current attempt:

df %>% group_by(ID) %>% 
  mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                         .f = function(i, t, data) {
                                           n <- data %>%
                                             filter(ID == i) %>%
                                             filter(between(as.double.difftime(times-t, units = "hours"),
                                                            0, 2)) %>%
                                             nrow()
                                           return(n)
                                         }, data = .))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

源来凯始玺欢你 2025-02-18 20:17:55

也许使用滑动窗口的矢量化方法在接下来的两个小时内计算后续观察结果？

library(tidyverse)
library(lubridate)
library(slider)

df <- tibble(
  ID = c(12, 12, 12, 13, 13, 13),
  times = c(
    as.POSIXct("2021-01-02 10:00:00"),
    as.POSIXct("2021-01-02 11:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 14:00:00"),
    as.POSIXct("2021-01-02 15:00:00")
  )
)

df |>
  group_by(ID) |>
  mutate(
    diff = difftime(times, min(times), units = "hours"),
    within_2 = if_else(diff <= 2, 1, 0),
    n_obs_within_2h = slide_dbl(within_2, sum, .after = Inf)
  ) |> 
  ungroup()

#> # A tibble: 6 × 5
#>      ID times               diff    within_2 n_obs_within_2h
#>   <dbl> <dttm>              <drtn>     <dbl>           <dbl>
#> 1    12 2021-01-02 10:00:00 0 hours        1               2
#> 2    12 2021-01-02 11:00:00 1 hours        1               1
#> 3    12 2021-01-02 13:00:00 3 hours        0               0
#> 4    13 2021-01-02 13:00:00 0 hours        1               3
#> 5    13 2021-01-02 14:00:00 1 hours        1               2
#> 6    13 2021-01-02 15:00:00 2 hours        1               1

^由

Maybe a vectorised approach using a sliding window to count subsequent observations within the next 2 hours?

library(tidyverse)
library(lubridate)
library(slider)

df <- tibble(
  ID = c(12, 12, 12, 13, 13, 13),
  times = c(
    as.POSIXct("2021-01-02 10:00:00"),
    as.POSIXct("2021-01-02 11:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 14:00:00"),
    as.POSIXct("2021-01-02 15:00:00")
  )
)

df |>
  group_by(ID) |>
  mutate(
    diff = difftime(times, min(times), units = "hours"),
    within_2 = if_else(diff <= 2, 1, 0),
    n_obs_within_2h = slide_dbl(within_2, sum, .after = Inf)
  ) |> 
  ungroup()

#> # A tibble: 6 × 5
#>      ID times               diff    within_2 n_obs_within_2h
#>   <dbl> <dttm>              <drtn>     <dbl>           <dbl>
#> 1    12 2021-01-02 10:00:00 0 hours        1               2
#> 2    12 2021-01-02 11:00:00 1 hours        1               1
#> 3    12 2021-01-02 13:00:00 3 hours        0               0
#> 4    13 2021-01-02 13:00:00 0 hours        1               3
#> 5    13 2021-01-02 14:00:00 1 hours        1               2
#> 6    13 2021-01-02 15:00:00 2 hours        1               1

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

回复收藏 0 原文

清醇 2025-02-18 20:17:55

在这种情况下，在映射中使用另一种方法可能更有可能是更好地性能的关键。我们可以像这样使用分组结构本身，而不是在完整数据上使用过滤：

df |>
  group_by(ID) %>%
  mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) %>%
  ungroup() 

# A tibble: 6 × 3
#     ID times               n_obs_with_2h
#  <dbl> <dttm>                      <dbl>
#     12 2021-01-02 10:00:00             2
#     12 2021-01-02 11:00:00             2
#     12 2021-01-02 13:00:00             1
#     13 2021-01-02 13:00:00             3
#     13 2021-01-02 14:00:00             2
#     13 2021-01-02 15:00:00             1

请参阅基准：（即使所提供的数据太小，以至于无法可靠。它在更大的集合上更快）

fun_original <- function(df) {
  
  df %>% group_by(ID) %>% 
    mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                           .f = function(i, t, data) {
                                             n <- data %>%
                                               filter(ID == i) %>%
                                               filter(between(as.double.difftime(times-t, units = "hours"),
                                                              0, 2)) %>%
                                               nrow()
                                             return(n)
                                           }, data = .)) %>% ungroup()
  
}

fun_new <- function(df) {
  
  df |>
    group_by(ID) |>
    mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) |>
    ungroup() 
  
}

bench::mark(fun_original(df), fun_new(df))

# A tibble: 2 × 13
#  expression            min   median `itr/sec` mem_alloc #`gc/sec` n_itr  n_gc total_time result  
#  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  
#  fun_original(df)  15.53ms  16.38ms      59.9   45.77KB     15.6    23     6      384ms <tibble>
#  fun_new(df)        1.74ms   1.95ms     486.     6.02KB     10.8   224     5      461ms <tibble>

Using another approach within the mapping is probably more likely to be the key to better performance in this case. Instead of using filtering on the full data, we could utilize the grouped structure itself like this :

df |>
  group_by(ID) %>%
  mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) %>%
  ungroup() 

# A tibble: 6 × 3
#     ID times               n_obs_with_2h
#  <dbl> <dttm>                      <dbl>
#     12 2021-01-02 10:00:00             2
#     12 2021-01-02 11:00:00             2
#     12 2021-01-02 13:00:00             1
#     13 2021-01-02 13:00:00             3
#     13 2021-01-02 14:00:00             2
#     13 2021-01-02 15:00:00             1

See benchmark: (Even if the provided data is too small for this to be reliable. That being said I would expect it to be even faster on a bigger set)

fun_original <- function(df) {
  
  df %>% group_by(ID) %>% 
    mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                           .f = function(i, t, data) {
                                             n <- data %>%
                                               filter(ID == i) %>%
                                               filter(between(as.double.difftime(times-t, units = "hours"),
                                                              0, 2)) %>%
                                               nrow()
                                             return(n)
                                           }, data = .)) %>% ungroup()
  
}

fun_new <- function(df) {
  
  df |>
    group_by(ID) |>
    mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) |>
    ungroup() 
  
}

bench::mark(fun_original(df), fun_new(df))

# A tibble: 2 × 13
#  expression            min   median `itr/sec` mem_alloc #`gc/sec` n_itr  n_gc total_time result  
#  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  
#  fun_original(df)  15.53ms  16.38ms      59.9   45.77KB     15.6    23     6      384ms <tibble>
#  fun_new(df)        1.74ms   1.95ms     486.     6.02KB     10.8   224     5      461ms <tibble>

回复收藏 0 原文

~没有更多了~