如何使用ID和间隔日期/时间与不同的行号一起加入Tibbles/DataFrames?
我有以下示例的这两个数据集:
library(lubridate)
library(tidyverse)
#dataset 1
id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2",
"A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2",
"B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00"))
df <- tibble(id, date)
# A tibble: 20 x 2
id date
<chr> <dttm>
1 A_1 2017-11-26 09:00:00
2 A_1 2017-11-26 09:05:00
3 A_1 2017-11-30 09:00:00
4 A_1 2017-11-30 09:05:00
5 A_1 2017-12-02 09:00:00
6 A_2 2017-11-26 09:00:00
7 A_2 2017-11-26 09:05:00
8 A_2 2017-11-30 09:00:00
9 A_2 2017-11-30 09:05:00
10 A_2 2017-12-02 09:00:00
11 B_1 2017-11-26 09:00:00
12 B_1 2017-11-26 09:05:00
13 B_1 2017-11-30 09:00:00
14 B_1 2017-11-30 09:05:00
15 B_1 2017-12-02 09:00:00
16 B_2 2017-11-26 09:00:00
17 B_2 2017-11-26 09:05:00
18 B_2 2017-11-30 09:00:00
19 B_2 2017-11-30 09:05:00
20 B_2 2017-12-02 09:00:00
#dataset 2
id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)
# A tibble: 4 x 4
id date variable1 variable2
<chr> <dttm> <chr> <chr>
1 A 2017-11-26 09:01:30 67 x
2 A 2017-11-30 09:06:40 30 y
3 B 2017-11-30 09:04:50 28 z
4 B 2017-12-02 09:01:00 90 w
我首先需要按“ ID”进行分组,然后按“日期和时间”进行分组,然后在数据集1中最接近的小时提取数据集2的列(条件:对于每行,每行连接到上一个最大小时5分钟)在数据集1中创建新列。
但是,数据集2中的每个“ ID”,在数据集1中发生50个时间,因此,数据集1中存在的一行可能会找到相应的小时50次在数据集1到同一日期中。对于每个“ ID”,我都需要此“提取”与相应的小时相同的次数,即使它是频繁的。
结果数据集将看起来像这样:
df_output
# A tibble: 20 x 5
id date date2 variable1 variable2
<chr> <dttm> <chr> <chr> <chr>
1 A_1 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
2 A_1 2017-11-26 09:05:00 NA NA NA
3 A_1 2017-11-30 09:00:00 NA NA NA
4 A_1 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
5 A_1 2017-12-02 09:00:00 NA NA NA
6 A_2 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
7 A_2 2017-11-26 09:05:00 NA NA NA
8 A_2 2017-11-30 09:00:00 NA NA NA
9 A_2 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
10 A_2 2017-12-02 09:00:00 NA NA NA
11 B_1 2017-11-26 09:00:00 NA NA NA
12 B_1 2017-11-26 09:05:00 NA NA NA
13 B_1 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
14 B_1 2017-11-30 09:05:00 NA NA NA
15 B_1 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
16 B_2 2017-11-26 09:00:00 NA NA NA
17 B_2 2017-11-26 09:05:00 NA NA NA
18 B_2 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
19 B_2 2017-11-30 09:05:00 NA NA NA
20 B_2 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
注意:我仍然需要考虑并非所有行都会在DataSet2中具有对应的内容,因此,这些行必须用Na填充。
提前致谢。
I have these two datasets exemplified below:
library(lubridate)
library(tidyverse)
#dataset 1
id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2",
"A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2",
"B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00"))
df <- tibble(id, date)
# A tibble: 20 x 2
id date
<chr> <dttm>
1 A_1 2017-11-26 09:00:00
2 A_1 2017-11-26 09:05:00
3 A_1 2017-11-30 09:00:00
4 A_1 2017-11-30 09:05:00
5 A_1 2017-12-02 09:00:00
6 A_2 2017-11-26 09:00:00
7 A_2 2017-11-26 09:05:00
8 A_2 2017-11-30 09:00:00
9 A_2 2017-11-30 09:05:00
10 A_2 2017-12-02 09:00:00
11 B_1 2017-11-26 09:00:00
12 B_1 2017-11-26 09:05:00
13 B_1 2017-11-30 09:00:00
14 B_1 2017-11-30 09:05:00
15 B_1 2017-12-02 09:00:00
16 B_2 2017-11-26 09:00:00
17 B_2 2017-11-26 09:05:00
18 B_2 2017-11-30 09:00:00
19 B_2 2017-11-30 09:05:00
20 B_2 2017-12-02 09:00:00
#dataset 2
id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)
# A tibble: 4 x 4
id date variable1 variable2
<chr> <dttm> <chr> <chr>
1 A 2017-11-26 09:01:30 67 x
2 A 2017-11-30 09:06:40 30 y
3 B 2017-11-30 09:04:50 28 z
4 B 2017-12-02 09:01:00 90 w
I first need to group by "id", then by "date and time", and then extract the columns of dataset 2 for the nearest hour in the dataset 1 (condition: for each row connect to previous maximum hour 5 min) creating new columns in the dataset 1.
But, each "id" in the data set 2, occurs 50 time in the dataset 1, herefore, an row present in dataset 1 probabilly will find an corresponding hour 50 times in the dataset 1 to same date. I need that, for each "id", this "extraction" is done the same number of times as there is a corresponding hour, even if it is frequent.
The resulting dataset would look like this:
df_output
# A tibble: 20 x 5
id date date2 variable1 variable2
<chr> <dttm> <chr> <chr> <chr>
1 A_1 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
2 A_1 2017-11-26 09:05:00 NA NA NA
3 A_1 2017-11-30 09:00:00 NA NA NA
4 A_1 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
5 A_1 2017-12-02 09:00:00 NA NA NA
6 A_2 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
7 A_2 2017-11-26 09:05:00 NA NA NA
8 A_2 2017-11-30 09:00:00 NA NA NA
9 A_2 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
10 A_2 2017-12-02 09:00:00 NA NA NA
11 B_1 2017-11-26 09:00:00 NA NA NA
12 B_1 2017-11-26 09:05:00 NA NA NA
13 B_1 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
14 B_1 2017-11-30 09:05:00 NA NA NA
15 B_1 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
16 B_2 2017-11-26 09:00:00 NA NA NA
17 B_2 2017-11-26 09:05:00 NA NA NA
18 B_2 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
19 B_2 2017-11-30 09:05:00 NA NA NA
20 B_2 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
note: I still need to consider that not all rows will have something corresponding in dataset2, therefore, these must be filled with NA's.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我们可以使用
lubridate
中的ceiling_date
将日期更改为“5 分钟”间隔。然后与data.table
进行非等值连接We may use
ceiling_date
fromlubridate
to change the date to '5 min' interval. Then do a non-equi join withdata.table