如何在 R 中根据子组和日期估算数据

发布于 2025-01-12 05:28:11 字数 2043 浏览 0 评论 0原文

我有一个包含气象信息的数据集。它包括变量：

DATE：2020 年每小时测量
STATION：收集信息的气象站代码
CITY：气象站名称监测站所在城市
10 个数值变量，例如温度、降水量、风速等

问题是并非所有监测站都测量相同的变量，所以有很多缺失值。此外，在具有特定传感器的站点中，并非所有日期都会被记录，因为传感器可能离线或某些日期失败。

我想要的（但我不知道如何实现）是：给定一个站点中的一些缺失数据，用同城其他站点的平均值来填充缺失值。

例如，考虑这个最小的代码示例：

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

第一个观察结果属于 STATION A 和 CITY LONDON 并且丢失了。但是 STATION B 也在 LONDON 并且当天它的值为 6，所以我希望用 6 来估算缺失值。

第 8 行属于 STATION B 和 CITY LONDON 且缺失。但是那天的 STATION A 的值为 4，因此我预计缺失值将被估算为 4。依此类推，预期输出将是：

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

请注意，实际上有 5 个不同的值每个城市有 4 个不同的观测站，每小时进行一次观测。

原文

I have a dataset with meteorological information. It includes the variables:

DATE: measured every hour along year 2020
STATION: code of the meteorological station where information was gathered
CITY: name of the city where the station is placed
10 numerical variables like temperature, precipitation, wind speed etc

The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.

What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.

So for example consider this minimal code example:

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

The first observation belongs to STATION A and CITY LONDON and is missing. But STATION B is also in LONDON and in that day it has a value 6, so I would expect that missing value to be imputed with 6.

Row 8 belongs to STATION B and CITY LONDON and is missing. But STATION A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瑕疵 2025-01-19 05:28:11

left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

如果您不喜欢连接方法，您也可以这样做：

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

输出（在任一方法下）：

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

If you don't like the join approach, you can also do this:

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

Output (under either approach):

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

回复收藏 0 原文

~没有更多了~