R:计算时间间隔内的行数
让我们假设以下数据框架:
df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
times = c(as.POSIXct("2021-01-02 10:00:00"),
as.POSIXct("2021-01-02 11:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 14:00:00"),
as.POSIXct("2021-01-02 15:00:00")))
ID times
<dbl> <dttm>
1 12 2021-01-02 10:00:00
2 12 2021-01-02 11:00:00
3 12 2021-01-02 13:00:00
4 13 2021-01-02 13:00:00
5 13 2021-01-02 14:00:00
6 13 2021-01-02 15:00:00
我想要的是列,它将ID的每个时间戳视为启动值,并使用下一个2H计算后续观察的数量。因此,这是我的目标:
ID times n_obs_within_2h
<dbl> <dttm> <dbl>
1 12 2021-01-02 10:00:00 2
2 12 2021-01-02 11:00:00 2
3 12 2021-01-02 13:00:00 1
4 13 2021-01-02 13:00:00 3
5 13 2021-01-02 14:00:00 2
6 13 2021-01-02 15:00:00 1
我知道,通过在每行上迭代,可以通过purrr :: map
轻松完成。但是,我的原始数据集很大,这使得这样做是相当不可能的。除了在每行上迭代以实现计算n_obs_within_2h
编辑:我当前的尝试:
df %>% group_by(ID) %>%
mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times),
.f = function(i, t, data) {
n <- data %>%
filter(ID == i) %>%
filter(between(as.double.difftime(times-t, units = "hours"),
0, 2)) %>%
nrow()
return(n)
}, data = .))
let's assume the following dataframe:
df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
times = c(as.POSIXct("2021-01-02 10:00:00"),
as.POSIXct("2021-01-02 11:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 14:00:00"),
as.POSIXct("2021-01-02 15:00:00")))
ID times
<dbl> <dttm>
1 12 2021-01-02 10:00:00
2 12 2021-01-02 11:00:00
3 12 2021-01-02 13:00:00
4 13 2021-01-02 13:00:00
5 13 2021-01-02 14:00:00
6 13 2021-01-02 15:00:00
What I want is a column, that considers each timestamp of an ID as start value and computes the number of subsequent observation with the next 2h. So this is my goal:
ID times n_obs_within_2h
<dbl> <dttm> <dbl>
1 12 2021-01-02 10:00:00 2
2 12 2021-01-02 11:00:00 2
3 12 2021-01-02 13:00:00 1
4 13 2021-01-02 13:00:00 3
5 13 2021-01-02 14:00:00 2
6 13 2021-01-02 15:00:00 1
I know that this could be easily done with purrr::map
by iterating over each row. However, my original dataset it quite big which makes it fairly unefficient to do so. Can you think of another way than iterating over each row to achieve the computing n_obs_within_2h
EDIT: my current attempt:
df %>% group_by(ID) %>%
mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times),
.f = function(i, t, data) {
n <- data %>%
filter(ID == i) %>%
filter(between(as.double.difftime(times-t, units = "hours"),
0, 2)) %>%
nrow()
return(n)
}, data = .))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许使用滑动窗口的矢量化方法在接下来的两个小时内计算后续观察结果?
由
Maybe a vectorised approach using a sliding window to count subsequent observations within the next 2 hours?
Created on 2022-06-30 by the reprex package (v2.0.1)
在这种情况下,在映射中使用另一种方法可能更有可能是更好地性能的关键。我们可以像这样使用分组结构本身,而不是在完整数据上使用过滤:
请参阅基准:(即使所提供的数据太小,以至于无法可靠。它在更大的集合上更快)
Using another approach within the mapping is probably more likely to be the key to better performance in this case. Instead of using filtering on the full data, we could utilize the grouped structure itself like this :
See benchmark: (Even if the provided data is too small for this to be reliable. That being said I would expect it to be even faster on a bigger set)