如何在 R 中根据子组和日期估算数据
我有一个包含气象信息的数据集。它包括变量:
DATE
:2020 年每小时测量STATION
:收集信息的气象站代码CITY
:气象站名称监测站所在城市- 10 个数值变量,例如温度、降水量、风速等
问题是并非所有监测站都测量相同的变量,所以有很多缺失值。此外,在具有特定传感器的站点中,并非所有日期都会被记录,因为传感器可能离线或某些日期失败。
我想要的(但我不知道如何实现)是:给定一个站点中的一些缺失数据,用同城其他站点的平均值来填充缺失值。
例如,考虑这个最小的代码示例:
df = tibble(
DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
STATION = rep(c('A', 'B', 'C', 'D'), each=5),
CITY = rep(c('LONDON', 'PARIS'), each=10),
var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)
第一个观察结果属于 STATION
A 和 CITY LONDON
并且丢失了。但是 STATION
B 也在 LONDON
并且当天它的值为 6,所以我希望用 6 来估算缺失值。
第 8 行属于 STATION
B 和 CITY LONDON
且缺失。但是那天的 STATION A 的值为 4,因此我预计缺失值将被估算为 4。依此类推,预期输出将是:
DATE STATION CITY var
<dttm> <chr> <chr> <dbl>
1 2020-01-01 12:00:00 A LONDON 6
2 2020-01-02 12:00:00 A LONDON 7
3 2020-01-03 12:00:00 A LONDON 8
4 2020-01-04 12:00:00 A LONDON 4
5 2020-01-05 12:00:00 A LONDON 5
6 2020-01-01 12:00:00 B LONDON 6
7 2020-01-02 12:00:00 B LONDON 7
8 2020-01-03 12:00:00 B LONDON 8
9 2020-01-04 12:00:00 B LONDON 4
10 2020-01-05 12:00:00 B LONDON 10
11 2020-01-01 12:00:00 C PARIS 16
12 2020-01-02 12:00:00 C PARIS 17
13 2020-01-03 12:00:00 C PARIS 13
14 2020-01-04 12:00:00 C PARIS 14
15 2020-01-05 12:00:00 C PARIS 15
16 2020-01-01 12:00:00 D PARIS 16
17 2020-01-02 12:00:00 D PARIS 17
18 2020-01-03 12:00:00 D PARIS 18
19 2020-01-04 12:00:00 D PARIS 19
20 2020-01-05 12:00:00 D PARIS 20
请注意,实际上有 5 个不同的值每个城市有 4 个不同的观测站,每小时进行一次观测。
I have a dataset with meteorological information. It includes the variables:
DATE
: measured every hour along year 2020STATION
: code of the meteorological station where information was gatheredCITY
: name of the city where the station is placed- 10 numerical variables like
temperature
,precipitation
,wind speed
etc
The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.
What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.
So for example consider this minimal code example:
df = tibble(
DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
STATION = rep(c('A', 'B', 'C', 'D'), each=5),
CITY = rep(c('LONDON', 'PARIS'), each=10),
var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)
The first observation belongs to STATION
A and CITY LONDON
and is missing. But STATION
B is also in LONDON
and in that day it has a value 6, so I would expect that missing value to be imputed with 6.
Row 8 belongs to STATION
B and CITY LONDON
and is missing. But STATION
A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:
DATE STATION CITY var
<dttm> <chr> <chr> <dbl>
1 2020-01-01 12:00:00 A LONDON 6
2 2020-01-02 12:00:00 A LONDON 7
3 2020-01-03 12:00:00 A LONDON 8
4 2020-01-04 12:00:00 A LONDON 4
5 2020-01-05 12:00:00 A LONDON 5
6 2020-01-01 12:00:00 B LONDON 6
7 2020-01-02 12:00:00 B LONDON 7
8 2020-01-03 12:00:00 B LONDON 8
9 2020-01-04 12:00:00 B LONDON 4
10 2020-01-05 12:00:00 B LONDON 10
11 2020-01-01 12:00:00 C PARIS 16
12 2020-01-02 12:00:00 C PARIS 17
13 2020-01-03 12:00:00 C PARIS 13
14 2020-01-04 12:00:00 C PARIS 14
15 2020-01-05 12:00:00 C PARIS 15
16 2020-01-01 12:00:00 D PARIS 16
17 2020-01-02 12:00:00 D PARIS 17
18 2020-01-03 12:00:00 D PARIS 18
19 2020-01-04 12:00:00 D PARIS 19
20 2020-01-05 12:00:00 D PARIS 20
Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您不喜欢连接方法,您也可以这样做:
输出(在任一方法下):
If you don't like the join approach, you can also do this:
Output (under either approach):