在组中使用尾巴()使我有些困惑

发布于 2025-02-12 17:59:22 字数 3236 浏览 0 评论 0原文

我有2个dfs,基本上包含一个日期,一个音量和另外一列。在一个DF上,该列是股票,另一个是位置。

> df
# A tibble: 54,885 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        2 2003-01-02   1809
 2        1 2003-01-03   1831
 3        1 2003-01-06   2465
 4        1 2003-01-07   1215
 5        1 2003-01-08    955
 6        1 2003-01-09   1192
 7        1 2003-01-10   1901
 8        1 2003-01-13   2110
 9        1 2003-01-14   2521
10        1 2003-01-15   1704
# … with 54,875 more rows 
> df2
# A tibble: 154 × 3
   ticker date           volume
   <chr>  <date>          <dbl>
 1 X29L2  2022-06-01  34015836 
 2 X29L2  2022-06-02 255554864 
 3 X29L2  2022-06-03  52779492 
 4 X29L2  2022-06-06 846971456 
 5 X29L2  2022-06-07 433462592 
 6 X29L2  2022-06-08     31365.
 7 X29L2  2022-06-09   8364060.
 8 X29L2  2022-06-10   7550020 
 9 X29L2  2022-06-13  39714244 
10 X29L2  2022-06-14 821900608 
# … with 144 more rows

我想要每个位置,以及每个股票的最后日期。

让我们从DF2开始:

> df2 %>% 
+   group_by(ticker) %>%
+   do(tail(., n=1))
# A tibble: 7 × 3
# Groups:   ticker [7]
  ticker date           volume
  <chr>  <date>          <dbl>
1 X16D2  2022-07-04 115893395.
2 X16G2  2022-07-04 434399604.
3 X17F3  2022-07-04 206883540.
4 X19Y3  2022-07-04 317255104.
5 X20E3  2022-07-04 291394381.
6 X21O2  2022-07-04 186407123.
7 X29L2  2022-07-04  69635266.

效果很好。日期的范围是

> range(df2$date)
[1] "2022-06-01" "2022-07-04"

,这是我期望的正确输出。

让我们对DF进行相同的操作:

> df %>%   
+   group_by(Position) %>%
+   do(tail(., n=1))
# A tibble: 20 × 3
# Groups:   Position [20]
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        1 2021-12-30 163917
 2        2 2021-11-30 969631
 3        3 2021-10-29 153590
 4        4 2021-09-30  97777
 5        5 2021-08-31 188115
 6        6 2022-07-01   5277
 7        7 2022-06-30  24808
 8        8 2022-05-31  28236
 9        9 2022-04-29   1499
10       10 2022-03-31    197
11       11 2022-02-25    500
12       12 2022-01-31     NA
13       13 2021-12-30     NA
14       14 2015-09-30     NA
15       15 2010-10-29     NA
16       16 2010-09-30     NA
17       17 2010-08-31     NA
18       18 2010-07-30     NA
19       19 2010-03-31     NA
20       20 2009-12-30     NA

位置的范围是:

> range(df$Position)
[1]  1 20

但是如图所示,它可以检索不同的日期,而不是最后一个。 这里的日期范围是:

> range(df$Date)
[1] "2003-01-02" "2022-07-01"

肯定有一个位置1的数据,最多有12个,如下所示:

> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        7 2022-07-01   2052
 2        8 2022-07-01   2644
 3        9 2022-07-01    357
 4       10 2022-07-01    260
 5       11 2022-07-01    491
 6       12 2022-07-01    100
 7        1 2022-07-01 525635
 8        2 2022-07-01 107201
 9        3 2022-07-01  39664
10        4 2022-07-01  12479
11        5 2022-07-01  12568
12        6 2022-07-01   5277

有人可以帮助我理解它们的工作方式有所不同,以及我该怎么做才能获得与DF2相同的结果?

谢谢!

I have 2 DFs that contains basically a Date, a Volume, and one more column. On one df that column is ticker, and on the other is Position.

> df
# A tibble: 54,885 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        2 2003-01-02   1809
 2        1 2003-01-03   1831
 3        1 2003-01-06   2465
 4        1 2003-01-07   1215
 5        1 2003-01-08    955
 6        1 2003-01-09   1192
 7        1 2003-01-10   1901
 8        1 2003-01-13   2110
 9        1 2003-01-14   2521
10        1 2003-01-15   1704
# … with 54,875 more rows 
> df2
# A tibble: 154 × 3
   ticker date           volume
   <chr>  <date>          <dbl>
 1 X29L2  2022-06-01  34015836 
 2 X29L2  2022-06-02 255554864 
 3 X29L2  2022-06-03  52779492 
 4 X29L2  2022-06-06 846971456 
 5 X29L2  2022-06-07 433462592 
 6 X29L2  2022-06-08     31365.
 7 X29L2  2022-06-09   8364060.
 8 X29L2  2022-06-10   7550020 
 9 X29L2  2022-06-13  39714244 
10 X29L2  2022-06-14 821900608 
# … with 144 more rows

I want for each Position, and for each ticker, the last Dates available.

Let's start with df2:

> df2 %>% 
+   group_by(ticker) %>%
+   do(tail(., n=1))
# A tibble: 7 × 3
# Groups:   ticker [7]
  ticker date           volume
  <chr>  <date>          <dbl>
1 X16D2  2022-07-04 115893395.
2 X16G2  2022-07-04 434399604.
3 X17F3  2022-07-04 206883540.
4 X19Y3  2022-07-04 317255104.
5 X20E3  2022-07-04 291394381.
6 X21O2  2022-07-04 186407123.
7 X29L2  2022-07-04  69635266.

That works great. The range of Dates is

> range(df2$date)
[1] "2022-06-01" "2022-07-04"

and this is the correct output I was expecting.

Let's do the same with df:

> df %>%   
+   group_by(Position) %>%
+   do(tail(., n=1))
# A tibble: 20 × 3
# Groups:   Position [20]
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        1 2021-12-30 163917
 2        2 2021-11-30 969631
 3        3 2021-10-29 153590
 4        4 2021-09-30  97777
 5        5 2021-08-31 188115
 6        6 2022-07-01   5277
 7        7 2022-06-30  24808
 8        8 2022-05-31  28236
 9        9 2022-04-29   1499
10       10 2022-03-31    197
11       11 2022-02-25    500
12       12 2022-01-31     NA
13       13 2021-12-30     NA
14       14 2015-09-30     NA
15       15 2010-10-29     NA
16       16 2010-09-30     NA
17       17 2010-08-31     NA
18       18 2010-07-30     NA
19       19 2010-03-31     NA
20       20 2009-12-30     NA

The range of Position is:

> range(df$Position)
[1]  1 20

But as it is shown, it retrieves different Dates and they're not the last ones.
the range of Dates here is:

> range(df$Date)
[1] "2003-01-02" "2022-07-01"

And there is certainly data for Position 1 up to 12 as shown below:

> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        7 2022-07-01   2052
 2        8 2022-07-01   2644
 3        9 2022-07-01    357
 4       10 2022-07-01    260
 5       11 2022-07-01    491
 6       12 2022-07-01    100
 7        1 2022-07-01 525635
 8        2 2022-07-01 107201
 9        3 2022-07-01  39664
10        4 2022-07-01  12479
11        5 2022-07-01  12568
12        6 2022-07-01   5277

Anyone can help me understand why they worked so differently and what can I do to get the same result as in df2?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

提笔落墨 2025-02-19 17:59:22

在调用tail do 中,可能是不安排的情况。我们要么安排按'位置','日期'的数据,然后使用slice_tail by'tocare'进行组

library(dplyr)
df %>% 
   arrange(Position, Date) %>%
   group_by(Position) %>% 
   slice_tail(n = 1) %>%
   # or may use do, but it can be slow
   # do(tail(., n = 1))
   ungroup   

,或者直接使用slice_max没有安排 in

df %>%
   group_by(Position) %>%
   slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
   ungroup

desc结束订单中的之后>返回第一个唯一

df %>%
   arrange(Position, desc(Date)) %>%
   distinct(Position, .keep_all = TRUE)

值此方法的优势是它不需要分组,然后ungroup(删除组属性)

It may be a case of not arrangeing the data before calling the tail within do. Either we arrange the data by 'Position', 'Date' and then do a group by 'Position' with slice_tail

library(dplyr)
df %>% 
   arrange(Position, Date) %>%
   group_by(Position) %>% 
   slice_tail(n = 1) %>%
   # or may use do, but it can be slow
   # do(tail(., n = 1))
   ungroup   

Or directly use slice_max without arrangeing

df %>%
   group_by(Position) %>%
   slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
   ungroup

Or may also use distinct after arrangeing in descending order as distinct returns the first unique value

df %>%
   arrange(Position, desc(Date)) %>%
   distinct(Position, .keep_all = TRUE)

The advantage with this method is it doesn't require grouping and then ungroup (to remove the group attribute)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文