如何使用ID和间隔日期/时间与不同的行号一起加入Tibbles/DataFrames？

发布于 2025-01-17 20:48:42 字数 4042 浏览 0 评论 0原文

我有以下示例的这两个数据集：

library(lubridate)
library(tidyverse)

#dataset 1

id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2", 
        "A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2", 
        "B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00"))    

df <- tibble(id, date)

# A tibble: 20 x 2
   id    date               
   <chr> <dttm>             
 1 A_1   2017-11-26 09:00:00
 2 A_1   2017-11-26 09:05:00
 3 A_1   2017-11-30 09:00:00
 4 A_1   2017-11-30 09:05:00
 5 A_1   2017-12-02 09:00:00
 6 A_2   2017-11-26 09:00:00
 7 A_2   2017-11-26 09:05:00
 8 A_2   2017-11-30 09:00:00
 9 A_2   2017-11-30 09:05:00
10 A_2   2017-12-02 09:00:00
11 B_1   2017-11-26 09:00:00
12 B_1   2017-11-26 09:05:00
13 B_1   2017-11-30 09:00:00
14 B_1   2017-11-30 09:05:00
15 B_1   2017-12-02 09:00:00
16 B_2   2017-11-26 09:00:00
17 B_2   2017-11-26 09:05:00
18 B_2   2017-11-30 09:00:00
19 B_2   2017-11-30 09:05:00
20 B_2   2017-12-02 09:00:00

#dataset 2

id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)

# A tibble: 4 x 4
  id    date                variable1 variable2
  <chr> <dttm>              <chr>     <chr>    
1 A     2017-11-26 09:01:30 67        x        
2 A     2017-11-30 09:06:40 30        y        
3 B     2017-11-30 09:04:50 28        z        
4 B     2017-12-02 09:01:00 90        w

我首先需要按“ ID”进行分组，然后按“日期和时间”进行分组，然后在数据集1中最接近的小时提取数据集2的列（条件：对于每行，每行连接到上一个最大小时5分钟）在数据集1中创建新列。

但是，数据集2中的每个“ ID”，在数据集1中发生50个时间，因此，数据集1中存在的一行可能会找到相应的小时50次在数据集1到同一日期中。对于每个“ ID”，我都需要此“提取”与相应的小时相同的次数，即使它是频繁的。

结果数据集将看起来像这样：

df_output
# A tibble: 20 x 5
   id    date                date2               variable1 variable2
   <chr> <dttm>              <chr>               <chr>     <chr>    
 1 A_1   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 2 A_1   2017-11-26 09:05:00 NA                  NA        NA       
 3 A_1   2017-11-30 09:00:00 NA                  NA        NA       
 4 A_1   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
 5 A_1   2017-12-02 09:00:00 NA                  NA        NA       
 6 A_2   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 7 A_2   2017-11-26 09:05:00 NA                  NA        NA       
 8 A_2   2017-11-30 09:00:00 NA                  NA        NA       
 9 A_2   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
10 A_2   2017-12-02 09:00:00 NA                  NA        NA       
11 B_1   2017-11-26 09:00:00 NA                  NA        NA       
12 B_1   2017-11-26 09:05:00 NA                  NA        NA       
13 B_1   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
14 B_1   2017-11-30 09:05:00 NA                  NA        NA       
15 B_1   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w        
16 B_2   2017-11-26 09:00:00 NA                  NA        NA       
17 B_2   2017-11-26 09:05:00 NA                  NA        NA       
18 B_2   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
19 B_2   2017-11-30 09:05:00 NA                  NA        NA       
20 B_2   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w

注意：我仍然需要考虑并非所有行都会在DataSet2中具有对应的内容，因此，这些行必须用Na填充。

提前致谢。

原文

I have these two datasets exemplified below:

library(lubridate)
library(tidyverse)

#dataset 1

id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2", 
        "A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2", 
        "B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00"))    

df <- tibble(id, date)

# A tibble: 20 x 2
   id    date               
   <chr> <dttm>             
 1 A_1   2017-11-26 09:00:00
 2 A_1   2017-11-26 09:05:00
 3 A_1   2017-11-30 09:00:00
 4 A_1   2017-11-30 09:05:00
 5 A_1   2017-12-02 09:00:00
 6 A_2   2017-11-26 09:00:00
 7 A_2   2017-11-26 09:05:00
 8 A_2   2017-11-30 09:00:00
 9 A_2   2017-11-30 09:05:00
10 A_2   2017-12-02 09:00:00
11 B_1   2017-11-26 09:00:00
12 B_1   2017-11-26 09:05:00
13 B_1   2017-11-30 09:00:00
14 B_1   2017-11-30 09:05:00
15 B_1   2017-12-02 09:00:00
16 B_2   2017-11-26 09:00:00
17 B_2   2017-11-26 09:05:00
18 B_2   2017-11-30 09:00:00
19 B_2   2017-11-30 09:05:00
20 B_2   2017-12-02 09:00:00

#dataset 2

id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)

# A tibble: 4 x 4
  id    date                variable1 variable2
  <chr> <dttm>              <chr>     <chr>    
1 A     2017-11-26 09:01:30 67        x        
2 A     2017-11-30 09:06:40 30        y        
3 B     2017-11-30 09:04:50 28        z        
4 B     2017-12-02 09:01:00 90        w

I first need to group by "id", then by "date and time", and then extract the columns of dataset 2 for the nearest hour in the dataset 1 (condition: for each row connect to previous maximum hour 5 min) creating new columns in the dataset 1.

But, each "id" in the data set 2, occurs 50 time in the dataset 1, herefore, an row present in dataset 1 probabilly will find an corresponding hour 50 times in the dataset 1 to same date. I need that, for each "id", this "extraction" is done the same number of times as there is a corresponding hour, even if it is frequent.

The resulting dataset would look like this:

df_output
# A tibble: 20 x 5
   id    date                date2               variable1 variable2
   <chr> <dttm>              <chr>               <chr>     <chr>    
 1 A_1   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 2 A_1   2017-11-26 09:05:00 NA                  NA        NA       
 3 A_1   2017-11-30 09:00:00 NA                  NA        NA       
 4 A_1   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
 5 A_1   2017-12-02 09:00:00 NA                  NA        NA       
 6 A_2   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 7 A_2   2017-11-26 09:05:00 NA                  NA        NA       
 8 A_2   2017-11-30 09:00:00 NA                  NA        NA       
 9 A_2   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
10 A_2   2017-12-02 09:00:00 NA                  NA        NA       
11 B_1   2017-11-26 09:00:00 NA                  NA        NA       
12 B_1   2017-11-26 09:05:00 NA                  NA        NA       
13 B_1   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
14 B_1   2017-11-30 09:05:00 NA                  NA        NA       
15 B_1   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w        
16 B_2   2017-11-26 09:00:00 NA                  NA        NA       
17 B_2   2017-11-26 09:05:00 NA                  NA        NA       
18 B_2   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
19 B_2   2017-11-30 09:05:00 NA                  NA        NA       
20 B_2   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w

note: I still need to consider that not all rows will have something corresponding in dataset2, therefore, these must be filled with NA's.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哎呦我呸! 2025-01-24 20:48:42

我们可以使用 lubridate 中的 ceiling_date 将日期更改为“5 分钟”间隔。然后与 data.table 进行非等值连接

library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
   mutate(date2 = ceiling_date(date, "5 min"), 
          date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\\d+")][
   setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,  
    variable1, variable2), on = .(id2 = id, date > date, date <= date2)]

We may use ceiling_date from lubridate to change the date to '5 min' interval. Then do a non-equi join with data.table

library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
   mutate(date2 = ceiling_date(date, "5 min"), 
          date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\\d+")][
   setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,  
    variable1, variable2), on = .(id2 = id, date > date, date <= date2)]

回复收藏 0 原文

~没有更多了~