在总结之前,在日期变量和描述性变量上使用group_by
我有一个结构如下的数据框:
# A tibble: 2,095,377 x 3
Date SalesPrice distance_ord
<date> <dbl> <fct>
1 1977-07-25 32000 Long
2 1981-08-31 200000 Long
3 1985-05-01 270000 Moderate
4 1987-06-01 20000 Short
5 1989-07-13 1400000 Short
6 1992-06-01 26000 Long
7 1993-06-15 159000 Very Long
8 1993-06-24 165000 Short
9 1993-05-24 215000 Very Long
10 1991-05-21 248000 Moderate
我试图通过 Date
信息将 distance_ord
信息分组在一起,然后计算平均值 SalesPrices
对于该日期
上的每个distance_ord
。
我尝试了这段代码:
testing$Date <- ymd(testing$Date)
testing.data <- testing %>%
group_by(Date, distance_ord) %>%
mutate(averged.prices = mean(SalesPrice)) %>%
filter(year(Date) >= 2010)
但是,正如您在下面的小标题中看到的,一些日期在 distance_ord
分组中重复。例如,“中等”distance_ord
具有多个 2010-01-05
和 2010-01-04
条目。
as_tibble(testing.data)
# A tibble: 678,032 x 4
Date SalesPrice distance_ord averged.prices
<date> <dbl> <fct> <dbl>
1 2010-01-05 317000 Moderate 476710.
2 2010-01-04 226950 Moderate 489254.
3 2010-01-04 160000 Short 309123.
4 2010-01-05 1000 Very Long 759615.
5 2010-01-04 160000 Moderate 489254.
6 2010-01-05 600000 Moderate 476710.
7 2010-01-05 600000 Moderate 476710.
8 2010-01-04 1710000 Long 463314.
9 2010-01-04 330000 Very Long 402171.
10 2010-01-02 9140 Long 393836
# ... with 678,022 more rows
这看起来是一件很简单的事情,但我无法弄清楚发生了什么。 Date
列是否可能存在问题导致此问题?每个Date
对于每个distance_ord
变量只能显示一次。
这是数据的可重现示例:
structure(list(Date = structure(c(2762, 4260, 5599, 6360, 7133,
8187, 8566, 8575, 8544, 7810, 8604, 8617, 8617, 8561, 8614, 8601,
8576, 8538, 8601, 8617), class = "Date"), SalesPrice = c(32000,
2e+05, 270000, 20000, 1400000, 26000, 159000, 165000, 215000,
248000, 202500, 389046, 177204, 855000, 290000, 275000, 85000,
117000, 130000, 704900), distance_ord = structure(c(1L, 1L, 2L,
3L, 3L, 1L, 4L, 3L, 4L, 2L, 4L, 2L, 1L, 1L, 2L, 3L, 3L, 2L, 1L,
1L), .Label = c("Long", "Moderate", "Short", "Very Long"), class = "factor")), .internal.selfref = <pointer: (nil)>, row.names = c(NA,
20L), class = c("data.table", "data.frame"))
I have a dataframe that is structured like this:
# A tibble: 2,095,377 x 3
Date SalesPrice distance_ord
<date> <dbl> <fct>
1 1977-07-25 32000 Long
2 1981-08-31 200000 Long
3 1985-05-01 270000 Moderate
4 1987-06-01 20000 Short
5 1989-07-13 1400000 Short
6 1992-06-01 26000 Long
7 1993-06-15 159000 Very Long
8 1993-06-24 165000 Short
9 1993-05-24 215000 Very Long
10 1991-05-21 248000 Moderate
I am attempting to group the distance_ord
information together by the Date
information and then calculate the mean SalesPrices
for each distance_ord
on that Date
.
I have attempted this code:
testing$Date <- ymd(testing$Date)
testing.data <- testing %>%
group_by(Date, distance_ord) %>%
mutate(averged.prices = mean(SalesPrice)) %>%
filter(year(Date) >= 2010)
However, as you can see in the below tibble, some dates are being duplicated across the distance_ord
grouping. For example, the "moderate" distance_ord
has multiple entries for 2010-01-05
and 2010-01-04
.
as_tibble(testing.data)
# A tibble: 678,032 x 4
Date SalesPrice distance_ord averged.prices
<date> <dbl> <fct> <dbl>
1 2010-01-05 317000 Moderate 476710.
2 2010-01-04 226950 Moderate 489254.
3 2010-01-04 160000 Short 309123.
4 2010-01-05 1000 Very Long 759615.
5 2010-01-04 160000 Moderate 489254.
6 2010-01-05 600000 Moderate 476710.
7 2010-01-05 600000 Moderate 476710.
8 2010-01-04 1710000 Long 463314.
9 2010-01-04 330000 Very Long 402171.
10 2010-01-02 9140 Long 393836
# ... with 678,022 more rows
This seems like such a simple thing to do, but I cannot figure out what is happening. Is there perhaps something wrong with the Date
column that is causing this? Each Date
should only show up once for each distance_ord
variable.
Here is a reproducible example of the data:
structure(list(Date = structure(c(2762, 4260, 5599, 6360, 7133,
8187, 8566, 8575, 8544, 7810, 8604, 8617, 8617, 8561, 8614, 8601,
8576, 8538, 8601, 8617), class = "Date"), SalesPrice = c(32000,
2e+05, 270000, 20000, 1400000, 26000, 159000, 165000, 215000,
248000, 202500, 389046, 177204, 855000, 290000, 275000, 85000,
117000, 130000, 704900), distance_ord = structure(c(1L, 1L, 2L,
3L, 3L, 1L, 4L, 3L, 4L, 2L, 4L, 2L, 1L, 1L, 2L, 3L, 3L, 2L, 1L,
1L), .Label = c("Long", "Moderate", "Short", "Very Long"), class = "factor")), .internal.selfref = <pointer: (nil)>, row.names = c(NA,
20L), class = c("data.table", "data.frame"))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论