如何在 R 中根据子组和日期估算数据

发布于 2025-01-12 05:28:11 字数 2043 浏览 0 评论 0原文

我有一个包含气象信息的数据集。它包括变量:

  • DATE:2020 年每小时测量
  • STATION:收集信息的气象站代码
  • CITY:气象站名称监测站所在城市
  • 10 个数值变量,例如温度、降水量、风速等

问题是并非所有监测站都测量相同的变量,所以有很多缺失值。此外,在具有特定传感器的站点中,并非所有日期都会被记录,因为传感器可能离线或某些日期失败。

我想要的(但我不知道如何实现)是:给定一个站点中的一些缺失数据,用同城其他站点的平均值来填充缺失值。

例如,考虑这个最小的代码示例:

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

第一个观察结果属于 STATION A 和 CITY LONDON 并且丢失了。但是 STATION B 也在 LONDON 并且当天它的值为 6,所以我希望用 6 来估算缺失值。

第 8 行属于 STATION B 和 CITY LONDON 且缺失。但是那天的 STATION A 的值为 4,因此我预计缺失值将被估算为 4。依此类推,预期输出将是:

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

请注意,实际上有 5 个不同的值每个城市有 4 个不同的观测站,每小时进行一次观测。

I have a dataset with meteorological information. It includes the variables:

  • DATE: measured every hour along year 2020
  • STATION: code of the meteorological station where information was gathered
  • CITY: name of the city where the station is placed
  • 10 numerical variables like temperature, precipitation, wind speed etc

The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.

What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.

So for example consider this minimal code example:

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

The first observation belongs to STATION A and CITY LONDON and is missing. But STATION B is also in LONDON and in that day it has a value 6, so I would expect that missing value to be imputed with 6.

Row 8 belongs to STATION B and CITY LONDON and is missing. But STATION A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

瑕疵 2025-01-19 05:28:11
left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

如果您不喜欢连接方法,您也可以这样做:

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

输出(在任一方法下):

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20
left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

If you don't like the join approach, you can also do this:

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

Output (under either approach):

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文