在总结之前，在日期变量和描述性变量上使用group_by

发布于 2025-01-20 04:16:26 字数 2659 浏览 4 评论 0原文

我有一个结构如下的数据框：

# A tibble: 2,095,377 x 3
   Date       SalesPrice distance_ord
   <date>          <dbl> <fct>       
 1 1977-07-25      32000 Long        
 2 1981-08-31     200000 Long        
 3 1985-05-01     270000 Moderate    
 4 1987-06-01      20000 Short       
 5 1989-07-13    1400000 Short       
 6 1992-06-01      26000 Long        
 7 1993-06-15     159000 Very Long   
 8 1993-06-24     165000 Short       
 9 1993-05-24     215000 Very Long   
10 1991-05-21     248000 Moderate

我试图通过 Date 信息将 distance_ord 信息分组在一起，然后计算平均值 SalesPrices对于该日期上的每个distance_ord。

我尝试了这段代码：

testing$Date <- ymd(testing$Date)

testing.data <- testing %>%
  group_by(Date, distance_ord) %>%
  mutate(averged.prices = mean(SalesPrice)) %>%
  filter(year(Date) >= 2010)

但是，正如您在下面的小标题中看到的，一些日期在 distance_ord 分组中重复。例如，“中等”distance_ord 具有多个 2010-01-05 和 2010-01-04 条目。

as_tibble(testing.data)
# A tibble: 678,032 x 4
   Date       SalesPrice distance_ord averged.prices
   <date>          <dbl> <fct>                 <dbl>
 1 2010-01-05     317000 Moderate            476710.
 2 2010-01-04     226950 Moderate            489254.
 3 2010-01-04     160000 Short               309123.
 4 2010-01-05       1000 Very Long           759615.
 5 2010-01-04     160000 Moderate            489254.
 6 2010-01-05     600000 Moderate            476710.
 7 2010-01-05     600000 Moderate            476710.
 8 2010-01-04    1710000 Long                463314.
 9 2010-01-04     330000 Very Long           402171.
10 2010-01-02       9140 Long                393836 
# ... with 678,022 more rows

这看起来是一件很简单的事情，但我无法弄清楚发生了什么。 Date 列是否可能存在问题导致此问题？每个Date 对于每个distance_ord 变量只能显示一次。

这是数据的可重现示例：

structure(list(Date = structure(c(2762, 4260, 5599, 6360, 7133, 
8187, 8566, 8575, 8544, 7810, 8604, 8617, 8617, 8561, 8614, 8601, 
8576, 8538, 8601, 8617), class = "Date"), SalesPrice = c(32000, 
2e+05, 270000, 20000, 1400000, 26000, 159000, 165000, 215000, 
248000, 202500, 389046, 177204, 855000, 290000, 275000, 85000, 
117000, 130000, 704900), distance_ord = structure(c(1L, 1L, 2L, 
3L, 3L, 1L, 4L, 3L, 4L, 2L, 4L, 2L, 1L, 1L, 2L, 3L, 3L, 2L, 1L, 
1L), .Label = c("Long", "Moderate", "Short", "Very Long"), class = "factor")), .internal.selfref = <pointer: (nil)>, row.names = c(NA, 
20L), class = c("data.table", "data.frame"))

原文

I have a dataframe that is structured like this:

# A tibble: 2,095,377 x 3
   Date       SalesPrice distance_ord
   <date>          <dbl> <fct>       
 1 1977-07-25      32000 Long        
 2 1981-08-31     200000 Long        
 3 1985-05-01     270000 Moderate    
 4 1987-06-01      20000 Short       
 5 1989-07-13    1400000 Short       
 6 1992-06-01      26000 Long        
 7 1993-06-15     159000 Very Long   
 8 1993-06-24     165000 Short       
 9 1993-05-24     215000 Very Long   
10 1991-05-21     248000 Moderate

I am attempting to group the distance_ord information together by the Date information and then calculate the mean SalesPrices for each distance_ord on that Date.

I have attempted this code:

testing$Date <- ymd(testing$Date)

testing.data <- testing %>%
  group_by(Date, distance_ord) %>%
  mutate(averged.prices = mean(SalesPrice)) %>%
  filter(year(Date) >= 2010)

However, as you can see in the below tibble, some dates are being duplicated across the distance_ord grouping. For example, the "moderate" distance_ord has multiple entries for 2010-01-05 and 2010-01-04.

as_tibble(testing.data)
# A tibble: 678,032 x 4
   Date       SalesPrice distance_ord averged.prices
   <date>          <dbl> <fct>                 <dbl>
 1 2010-01-05     317000 Moderate            476710.
 2 2010-01-04     226950 Moderate            489254.
 3 2010-01-04     160000 Short               309123.
 4 2010-01-05       1000 Very Long           759615.
 5 2010-01-04     160000 Moderate            489254.
 6 2010-01-05     600000 Moderate            476710.
 7 2010-01-05     600000 Moderate            476710.
 8 2010-01-04    1710000 Long                463314.
 9 2010-01-04     330000 Very Long           402171.
10 2010-01-02       9140 Long                393836 
# ... with 678,022 more rows

This seems like such a simple thing to do, but I cannot figure out what is happening. Is there perhaps something wrong with the Date column that is causing this? Each Date should only show up once for each distance_ord variable.

Here is a reproducible example of the data:

structure(list(Date = structure(c(2762, 4260, 5599, 6360, 7133, 
8187, 8566, 8575, 8544, 7810, 8604, 8617, 8617, 8561, 8614, 8601, 
8576, 8538, 8601, 8617), class = "Date"), SalesPrice = c(32000, 
2e+05, 270000, 20000, 1400000, 26000, 159000, 165000, 215000, 
248000, 202500, 389046, 177204, 855000, 290000, 275000, 85000, 
117000, 130000, 704900), distance_ord = structure(c(1L, 1L, 2L, 
3L, 3L, 1L, 4L, 3L, 4L, 2L, 4L, 2L, 1L, 1L, 2L, 3L, 3L, 2L, 1L, 
1L), .Label = c("Long", "Moderate", "Short", "Very Long"), class = "factor")), .internal.selfref = <pointer: (nil)>, row.names = c(NA, 
20L), class = c("data.table", "data.frame"))

分享到QQ

分享到微博