在组中使用尾巴()使我有些困惑
我有2个dfs,基本上包含一个日期,一个音量和另外一列。在一个DF上,该列是股票,另一个是位置。
> df
# A tibble: 54,885 × 3
Position Date Volume
<dbl> <date> <dbl>
1 2 2003-01-02 1809
2 1 2003-01-03 1831
3 1 2003-01-06 2465
4 1 2003-01-07 1215
5 1 2003-01-08 955
6 1 2003-01-09 1192
7 1 2003-01-10 1901
8 1 2003-01-13 2110
9 1 2003-01-14 2521
10 1 2003-01-15 1704
# … with 54,875 more rows
> df2
# A tibble: 154 × 3
ticker date volume
<chr> <date> <dbl>
1 X29L2 2022-06-01 34015836
2 X29L2 2022-06-02 255554864
3 X29L2 2022-06-03 52779492
4 X29L2 2022-06-06 846971456
5 X29L2 2022-06-07 433462592
6 X29L2 2022-06-08 31365.
7 X29L2 2022-06-09 8364060.
8 X29L2 2022-06-10 7550020
9 X29L2 2022-06-13 39714244
10 X29L2 2022-06-14 821900608
# … with 144 more rows
我想要每个位置,以及每个股票的最后日期。
让我们从DF2开始:
> df2 %>%
+ group_by(ticker) %>%
+ do(tail(., n=1))
# A tibble: 7 × 3
# Groups: ticker [7]
ticker date volume
<chr> <date> <dbl>
1 X16D2 2022-07-04 115893395.
2 X16G2 2022-07-04 434399604.
3 X17F3 2022-07-04 206883540.
4 X19Y3 2022-07-04 317255104.
5 X20E3 2022-07-04 291394381.
6 X21O2 2022-07-04 186407123.
7 X29L2 2022-07-04 69635266.
效果很好。日期的范围是
> range(df2$date)
[1] "2022-06-01" "2022-07-04"
,这是我期望的正确输出。
让我们对DF进行相同的操作:
> df %>%
+ group_by(Position) %>%
+ do(tail(., n=1))
# A tibble: 20 × 3
# Groups: Position [20]
Position Date Volume
<dbl> <date> <dbl>
1 1 2021-12-30 163917
2 2 2021-11-30 969631
3 3 2021-10-29 153590
4 4 2021-09-30 97777
5 5 2021-08-31 188115
6 6 2022-07-01 5277
7 7 2022-06-30 24808
8 8 2022-05-31 28236
9 9 2022-04-29 1499
10 10 2022-03-31 197
11 11 2022-02-25 500
12 12 2022-01-31 NA
13 13 2021-12-30 NA
14 14 2015-09-30 NA
15 15 2010-10-29 NA
16 16 2010-09-30 NA
17 17 2010-08-31 NA
18 18 2010-07-30 NA
19 19 2010-03-31 NA
20 20 2009-12-30 NA
位置的范围是:
> range(df$Position)
[1] 1 20
但是如图所示,它可以检索不同的日期,而不是最后一个。 这里的日期范围是:
> range(df$Date)
[1] "2003-01-02" "2022-07-01"
肯定有一个位置1的数据,最多有12个,如下所示:
> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
Position Date Volume
<dbl> <date> <dbl>
1 7 2022-07-01 2052
2 8 2022-07-01 2644
3 9 2022-07-01 357
4 10 2022-07-01 260
5 11 2022-07-01 491
6 12 2022-07-01 100
7 1 2022-07-01 525635
8 2 2022-07-01 107201
9 3 2022-07-01 39664
10 4 2022-07-01 12479
11 5 2022-07-01 12568
12 6 2022-07-01 5277
有人可以帮助我理解它们的工作方式有所不同,以及我该怎么做才能获得与DF2相同的结果?
谢谢!
I have 2 DFs that contains basically a Date, a Volume, and one more column. On one df that column is ticker, and on the other is Position.
> df
# A tibble: 54,885 × 3
Position Date Volume
<dbl> <date> <dbl>
1 2 2003-01-02 1809
2 1 2003-01-03 1831
3 1 2003-01-06 2465
4 1 2003-01-07 1215
5 1 2003-01-08 955
6 1 2003-01-09 1192
7 1 2003-01-10 1901
8 1 2003-01-13 2110
9 1 2003-01-14 2521
10 1 2003-01-15 1704
# … with 54,875 more rows
> df2
# A tibble: 154 × 3
ticker date volume
<chr> <date> <dbl>
1 X29L2 2022-06-01 34015836
2 X29L2 2022-06-02 255554864
3 X29L2 2022-06-03 52779492
4 X29L2 2022-06-06 846971456
5 X29L2 2022-06-07 433462592
6 X29L2 2022-06-08 31365.
7 X29L2 2022-06-09 8364060.
8 X29L2 2022-06-10 7550020
9 X29L2 2022-06-13 39714244
10 X29L2 2022-06-14 821900608
# … with 144 more rows
I want for each Position, and for each ticker, the last Dates available.
Let's start with df2:
> df2 %>%
+ group_by(ticker) %>%
+ do(tail(., n=1))
# A tibble: 7 × 3
# Groups: ticker [7]
ticker date volume
<chr> <date> <dbl>
1 X16D2 2022-07-04 115893395.
2 X16G2 2022-07-04 434399604.
3 X17F3 2022-07-04 206883540.
4 X19Y3 2022-07-04 317255104.
5 X20E3 2022-07-04 291394381.
6 X21O2 2022-07-04 186407123.
7 X29L2 2022-07-04 69635266.
That works great. The range of Dates is
> range(df2$date)
[1] "2022-06-01" "2022-07-04"
and this is the correct output I was expecting.
Let's do the same with df:
> df %>%
+ group_by(Position) %>%
+ do(tail(., n=1))
# A tibble: 20 × 3
# Groups: Position [20]
Position Date Volume
<dbl> <date> <dbl>
1 1 2021-12-30 163917
2 2 2021-11-30 969631
3 3 2021-10-29 153590
4 4 2021-09-30 97777
5 5 2021-08-31 188115
6 6 2022-07-01 5277
7 7 2022-06-30 24808
8 8 2022-05-31 28236
9 9 2022-04-29 1499
10 10 2022-03-31 197
11 11 2022-02-25 500
12 12 2022-01-31 NA
13 13 2021-12-30 NA
14 14 2015-09-30 NA
15 15 2010-10-29 NA
16 16 2010-09-30 NA
17 17 2010-08-31 NA
18 18 2010-07-30 NA
19 19 2010-03-31 NA
20 20 2009-12-30 NA
The range of Position is:
> range(df$Position)
[1] 1 20
But as it is shown, it retrieves different Dates and they're not the last ones.
the range of Dates here is:
> range(df$Date)
[1] "2003-01-02" "2022-07-01"
And there is certainly data for Position 1 up to 12 as shown below:
> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
Position Date Volume
<dbl> <date> <dbl>
1 7 2022-07-01 2052
2 8 2022-07-01 2644
3 9 2022-07-01 357
4 10 2022-07-01 260
5 11 2022-07-01 491
6 12 2022-07-01 100
7 1 2022-07-01 525635
8 2 2022-07-01 107201
9 3 2022-07-01 39664
10 4 2022-07-01 12479
11 5 2022-07-01 12568
12 6 2022-07-01 5277
Anyone can help me understand why they worked so differently and what can I do to get the same result as in df2?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在调用
tail
do 中,可能是不安排
的情况。我们要么安排
按'位置','日期'的数据,然后使用slice_tail
by'tocare'进行组,或者直接使用
slice_max
没有安排
in在
desc
结束订单中的之后>返回第一个唯一值此方法的优势是它不需要分组,然后
ungroup
(删除组属性)It may be a case of not
arrange
ing the data before calling thetail
withindo
. Either wearrange
the data by 'Position', 'Date' and then do a group by 'Position' withslice_tail
Or directly use
slice_max
withoutarrange
ingOr may also use
distinct
afterarrange
ing indesc
ending order asdistinct
returns the first unique valueThe advantage with this method is it doesn't require grouping and then
ungroup
(to remove the group attribute)