r中单个状态变量的变化之间的总和累计时间

发布于 2025-01-18 16:40:18 字数 2373 浏览 7 评论 0原文

我一直在寻找答案，并用代码弄乱了几个小时。我有一个数据集，看起来像以下特定ID：

# A tibble: 14 × 3
        ID state orderDate          
     <dbl> <chr> <dttm>             
 1 4227631 1     2022-03-14 19:00:00
 2 4227631 1     2022-03-14 20:00:00
 3 4227631 1     2022-03-15 11:00:00
 4 4227631 0     2022-03-15 11:00:00
 5 4227631 1     2022-03-15 20:00:00
 6 4227631 1     2022-03-16 04:00:00
 7 4227631 0     2022-03-16 04:00:00
 8 4227631 1     2022-03-16 05:00:00
 9 4227631 0     2022-03-16 13:00:00
10 4227631 1     2022-03-16 15:00:00

这是数百个ID发生的。在此示例中，我将dplyr用于group_by id。我只在乎何时状态在值之间变化，而不是在保持不变的情况下。

我想计算每个ID在状态1中保留的累积时间。在应忽略更改之前多次重复状态1的实例。我一直计划使用橄榄酸和DPLYR进行分析。

tibble我在此示例中使用了：

structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1", 
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400, 
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200, 
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

我尝试了各种解决方案，例如累积时间累积时间但是，我在lag上遇到了麻烦，并将其纳入此特定分析中。

预期的输出可能看起来像这样：

，然后我计划一起总结所有现状，以找出花费的时间在这个状态。

邀请所有更优雅的解决方案，或者如果某人与先前的问题有链接。

编辑使用下面的解决方案，我弄清楚了！该解决方案没有查看状态0立即遵循状态1的情况，我们想查看这些状态之间的总时间。

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
  mutate(hours1 = ifelse(max == lag(max) &
                           orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA))  %>% 
  mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1", 
                         difftime(orderDate, lag(orderDate), units = "h"), NA)) %>% 
  mutate(hours1 = replace_na(hours1, 0), 
         hours2 = replace_na(hours2, 0)) %>% 
  mutate(hours = hours1+hours2) %>% 
  select(-hours1, -hours2) %>% 
  summarise(total_hours = sum(hours, na.rm = TRUE)) %>% 
  filter(total_hours!=0)

原文

I've been looking for answers and messing around with my code for a couple hours. I have a dataset that looks like the following for a specific ID:

# A tibble: 14 × 3
        ID state orderDate          
     <dbl> <chr> <dttm>             
 1 4227631 1     2022-03-14 19:00:00
 2 4227631 1     2022-03-14 20:00:00
 3 4227631 1     2022-03-15 11:00:00
 4 4227631 0     2022-03-15 11:00:00
 5 4227631 1     2022-03-15 20:00:00
 6 4227631 1     2022-03-16 04:00:00
 7 4227631 0     2022-03-16 04:00:00
 8 4227631 1     2022-03-16 05:00:00
 9 4227631 0     2022-03-16 13:00:00
10 4227631 1     2022-03-16 15:00:00

This occurs for hundreds of IDs. For this example, I am using dplyr to group_by ID. I only care when status changes between values, not if it stays the same.

I want to calculate the cumulative time each ID remains in status 1. The instances where status 1 is repeated multiple times before it changes should be ignored. I have been planning to use lubridate and dplyr to perform the analysis.

Tibble I am using for this example:

structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1", 
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400, 
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200, 
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

I've tried various solutions such as Cumulative time with reset however I'm having trouble with lag and incorporating it into this specific analysis.

The expected output would maybe look something like this:

And then I would plan to sum all statusOne together to figure out cumulative time spent in this state.

Invite all more elegant solutions or if someone has a link to a prior question.

EDIT
Using solution below I figured it out!
The solution didn't look at the situations where state 0 immediately followed state 1 and we wanted to look at the total time elapsed between these states.

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
  mutate(hours1 = ifelse(max == lag(max) &
                           orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA))  %>% 
  mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1", 
                         difftime(orderDate, lag(orderDate), units = "h"), NA)) %>% 
  mutate(hours1 = replace_na(hours1, 0), 
         hours2 = replace_na(hours2, 0)) %>% 
  mutate(hours = hours1+hours2) %>% 
  select(-hours1, -hours2) %>% 
  summarise(total_hours = sum(hours, na.rm = TRUE)) %>% 
  filter(total_hours!=0)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巴黎盛开的樱花 2025-01-25 16:40:18

这远非优雅，但至少它似乎提供了正确的答案：

library(tidyverse)

df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
                            4227631, 4227631, 4227631, 4227631, 4227631),
                     state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
                     orderDate = structure(c(1647284400, 1647288000, 1647342000, 
                                             1647342000, 1647374400, 1647403200,
                                             1647403200, 1647406800, 1647435600, 
                                             1647442800), 
                                           tzone = "UTC",
                                           class = c("POSIXct", "POSIXt"))),
                row.names = c(NA, -10L),
                class = c("tbl_df", "tbl", "data.frame"))

df2 <- df %>%
  group_by(ID) %>%
  mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
         max = cumsum(tmp)) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  select(-tmp)

df3 <- df2 %>%
  group_by(max) %>%
  summarise(max, statusOne = sum(hours, na.rm = TRUE))

df4 <- left_join(df2, df3, by = "max") %>%
  distinct() %>%
  select(-c(max, hours)) %>%
  mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))

df4
#> # A tibble: 10 × 4
#> # Groups:   ID [1]
#>         ID state orderDate           statusOne
#>      <dbl> <chr> <dttm>                  <dbl>
#>  1 4227631 1     2022-03-14 19:00:00        16
#>  2 4227631 1     2022-03-14 20:00:00         0
#>  3 4227631 1     2022-03-15 11:00:00         0
#>  4 4227631 0     2022-03-15 11:00:00         0
#>  5 4227631 1     2022-03-15 20:00:00         8
#>  6 4227631 1     2022-03-16 04:00:00         0
#>  7 4227631 0     2022-03-16 04:00:00         0
#>  8 4227631 1     2022-03-16 05:00:00         0
#>  9 4227631 0     2022-03-16 13:00:00         0
#> 10 4227631 1     2022-03-16 15:00:00         0

^{由 reprex软件包（v2.0.1）}

编辑

要获得每个ID的总_hours状态= 1：

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#>        ID total_hours
#>     <dbl>       <dbl>
#> 1 4227631          24

^{在2022-04-04创建的the Total_hours state = 1，由 reprex软件包（v2.0.1）}

This is far from elegant, but at least it appears to provide the correct answer:

library(tidyverse)

df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
                            4227631, 4227631, 4227631, 4227631, 4227631),
                     state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
                     orderDate = structure(c(1647284400, 1647288000, 1647342000, 
                                             1647342000, 1647374400, 1647403200,
                                             1647403200, 1647406800, 1647435600, 
                                             1647442800), 
                                           tzone = "UTC",
                                           class = c("POSIXct", "POSIXt"))),
                row.names = c(NA, -10L),
                class = c("tbl_df", "tbl", "data.frame"))

df2 <- df %>%
  group_by(ID) %>%
  mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
         max = cumsum(tmp)) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  select(-tmp)

df3 <- df2 %>%
  group_by(max) %>%
  summarise(max, statusOne = sum(hours, na.rm = TRUE))

df4 <- left_join(df2, df3, by = "max") %>%
  distinct() %>%
  select(-c(max, hours)) %>%
  mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))

df4
#> # A tibble: 10 × 4
#> # Groups:   ID [1]
#>         ID state orderDate           statusOne
#>      <dbl> <chr> <dttm>                  <dbl>
#>  1 4227631 1     2022-03-14 19:00:00        16
#>  2 4227631 1     2022-03-14 20:00:00         0
#>  3 4227631 1     2022-03-15 11:00:00         0
#>  4 4227631 0     2022-03-15 11:00:00         0
#>  5 4227631 1     2022-03-15 20:00:00         8
#>  6 4227631 1     2022-03-16 04:00:00         0
#>  7 4227631 0     2022-03-16 04:00:00         0
#>  8 4227631 1     2022-03-16 05:00:00         0
#>  9 4227631 0     2022-03-16 13:00:00         0
#> 10 4227631 1     2022-03-16 15:00:00         0

^{Created on 2022-04-04 by the reprex package (v2.0.1)}

Edit

It's a lot more straightforward to get the total_hours state=1 for each ID:

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#>        ID total_hours
#>     <dbl>       <dbl>
#> 1 4227631          24

^{Created on 2022-04-04 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~