如何将百分位数（/分位数）值添加到数据框中的列

发布于 2025-01-09 21:54:09 字数 3597 浏览 1 评论 0原文

我的数据集包含从 2009 年到 2021 年每天河流的流量测量值。它分为季节：冬季（十二月、一月、二月）、春季（三月、四月、五月）、夏季（六月、四月、五月）。七月、八月）和秋季（九月、十月、十一月）。

这是我的数据集的示例：

> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
   river           year season month    date                flow_rate quality
   <chr>          <dbl> <chr>  <chr>    <dttm>                  <dbl> <chr>  
 1 chittern_brook  2009 Winter December 2009-12-01 00:00:00     0.059 Good   
 2 chittern_brook  2009 Winter December 2009-12-02 00:00:00     0.061 Good   
 3 chittern_brook  2009 Winter December 2009-12-03 00:00:00     0.064 Good   
 4 chittern_brook  2009 Winter December 2009-12-04 00:00:00     0.068 Good   
 5 chittern_brook  2009 Winter December 2009-12-05 00:00:00     0.076 Good   
 6 chittern_brook  2009 Winter December 2009-12-06 00:00:00     0.138 Good   
 7 chittern_brook  2009 Winter December 2009-12-07 00:00:00     0.592 Good   
 8 chittern_brook  2009 Winter December 2009-12-08 00:00:00     1.04  Good   
 9 chittern_brook  2009 Winter December 2009-12-09 00:00:00     1.46  Good   
10 chittern_brook  2009 Winter December 2009-12-10 00:00:00     1.7   Good   
# ... with 4,427 more rows

我想找到每年每个季节的第 95 个百分位数、第 5 个百分位数、中位数和平均值，并分别获取第 95 个百分位数、中位数和平均值新数据框中的列。

例如：

> (df)
# A tibble: 49 x 2
   season_label flow_rate_mean  Q95   Q5  flow_rate_median   
   <chr>            <dbl>
 1 Winter 2009     0.453        3     2        4
 2 Spring 2010     0.519        6     3        4 
 3 Summer 2010     0.0627       4     3        6
 4 Autumn 2010     0.0415       6     2        6
 5 Winter 2010     0.0622       8     3        3
 6 Spring 2011     0.188        10    3        2
 7 Summer 2011     0.0499       2     3        2
 8 Autumn 2011     0.0383       2     2        1
 9 Winter 2011     0.0461       5     2        7
10 Spring 2012     0.0925       3     2        8
# ... with 39 more rows

我目前有这个代码，它仅使用前两列创建上述数据框，但我希望它还包括第 95 个百分位数、第 5 个百分位数和中位数。这是可行的还是我需要单独完成然后将其合并到一个数据帧中？

df <- chitt_brook_wylye_2 %>% 
  dplyr::mutate(month = as.numeric(format(date,"%m")),
                year = as.numeric(format(date,"%Y")),
                season_id =  (12*year + month) %/% 3) %>% 
  dplyr::group_by(season_id) %>%
  dplyr::mutate(season_label = paste(season, min(year))) %>%  
  dplyr::group_by(season_id,season_label) %>%
  dplyr::summarise(flow_rate = mean(flow_rate))

可重现的示例和代码：

date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
        season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
        var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)

df <- data.frame(date,season,var) %>% # creating the dataframe
  dplyr::mutate(month = as.numeric(format(date,"%m")),
                year = as.numeric(format(date,"%Y")),
                season_id =  (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
  dplyr::group_by(season_id) %>% # Grouping by the id
  dplyr::mutate(season_label = paste(min(year),season)) %>% 
  dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
  dplyr::summarise(var = mean(var)) # Computing the mean

原文

My data set has flow rate measurements of a river for every day of the year from 2009 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).

This is a sample of my data set:

> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
   river           year season month    date                flow_rate quality
   <chr>          <dbl> <chr>  <chr>    <dttm>                  <dbl> <chr>  
 1 chittern_brook  2009 Winter December 2009-12-01 00:00:00     0.059 Good   
 2 chittern_brook  2009 Winter December 2009-12-02 00:00:00     0.061 Good   
 3 chittern_brook  2009 Winter December 2009-12-03 00:00:00     0.064 Good   
 4 chittern_brook  2009 Winter December 2009-12-04 00:00:00     0.068 Good   
 5 chittern_brook  2009 Winter December 2009-12-05 00:00:00     0.076 Good   
 6 chittern_brook  2009 Winter December 2009-12-06 00:00:00     0.138 Good   
 7 chittern_brook  2009 Winter December 2009-12-07 00:00:00     0.592 Good   
 8 chittern_brook  2009 Winter December 2009-12-08 00:00:00     1.04  Good   
 9 chittern_brook  2009 Winter December 2009-12-09 00:00:00     1.46  Good   
10 chittern_brook  2009 Winter December 2009-12-10 00:00:00     1.7   Good   
# ... with 4,427 more rows

I want to find the 95th percentile, 5th percentile, median and the mean of each season of every year and have the values for 95th 5th, median and mean in separate columns in a new dataframe.

For example:

> (df)
# A tibble: 49 x 2
   season_label flow_rate_mean  Q95   Q5  flow_rate_median   
   <chr>            <dbl>
 1 Winter 2009     0.453        3     2        4
 2 Spring 2010     0.519        6     3        4 
 3 Summer 2010     0.0627       4     3        6
 4 Autumn 2010     0.0415       6     2        6
 5 Winter 2010     0.0622       8     3        3
 6 Spring 2011     0.188        10    3        2
 7 Summer 2011     0.0499       2     3        2
 8 Autumn 2011     0.0383       2     2        1
 9 Winter 2011     0.0461       5     2        7
10 Spring 2012     0.0925       3     2        8
# ... with 39 more rows

I currently have this code which creates the above dataframe with just the first two columns but I would like it to also include 95th percentile, 5th percentile and median. Is this feasible or will I need to do it separately and then combine it into one dataframe?

df <- chitt_brook_wylye_2 %>% 
  dplyr::mutate(month = as.numeric(format(date,"%m")),
                year = as.numeric(format(date,"%Y")),
                season_id =  (12*year + month) %/% 3) %>% 
  dplyr::group_by(season_id) %>%
  dplyr::mutate(season_label = paste(season, min(year))) %>%  
  dplyr::group_by(season_id,season_label) %>%
  dplyr::summarise(flow_rate = mean(flow_rate))

Reproducible example and code:

date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
        season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
        var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)

df <- data.frame(date,season,var) %>% # creating the dataframe
  dplyr::mutate(month = as.numeric(format(date,"%m")),
                year = as.numeric(format(date,"%Y")),
                season_id =  (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
  dplyr::group_by(season_id) %>% # Grouping by the id
  dplyr::mutate(season_label = paste(min(year),season)) %>% 
  dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
  dplyr::summarise(var = mean(var)) # Computing the mean

分享到QQ

分享到微博