在组中使用尾巴（）使我有些困惑

发布于 2025-02-12 17:59:22 字数 3236 浏览 0 评论 0原文

我有2个dfs，基本上包含一个日期，一个音量和另外一列。在一个DF上，该列是股票，另一个是位置。

> df
# A tibble: 54,885 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        2 2003-01-02   1809
 2        1 2003-01-03   1831
 3        1 2003-01-06   2465
 4        1 2003-01-07   1215
 5        1 2003-01-08    955
 6        1 2003-01-09   1192
 7        1 2003-01-10   1901
 8        1 2003-01-13   2110
 9        1 2003-01-14   2521
10        1 2003-01-15   1704
# … with 54,875 more rows

> df2
# A tibble: 154 × 3
   ticker date           volume
   <chr>  <date>          <dbl>
 1 X29L2  2022-06-01  34015836 
 2 X29L2  2022-06-02 255554864 
 3 X29L2  2022-06-03  52779492 
 4 X29L2  2022-06-06 846971456 
 5 X29L2  2022-06-07 433462592 
 6 X29L2  2022-06-08     31365.
 7 X29L2  2022-06-09   8364060.
 8 X29L2  2022-06-10   7550020 
 9 X29L2  2022-06-13  39714244 
10 X29L2  2022-06-14 821900608 
# … with 144 more rows

我想要每个位置，以及每个股票的最后日期。

让我们从DF2开始：

> df2 %>% 
+   group_by(ticker) %>%
+   do(tail(., n=1))
# A tibble: 7 × 3
# Groups:   ticker [7]
  ticker date           volume
  <chr>  <date>          <dbl>
1 X16D2  2022-07-04 115893395.
2 X16G2  2022-07-04 434399604.
3 X17F3  2022-07-04 206883540.
4 X19Y3  2022-07-04 317255104.
5 X20E3  2022-07-04 291394381.
6 X21O2  2022-07-04 186407123.
7 X29L2  2022-07-04  69635266.

效果很好。日期的范围是

> range(df2$date)
[1] "2022-06-01" "2022-07-04"

，这是我期望的正确输出。

让我们对DF进行相同的操作：

> df %>%   
+   group_by(Position) %>%
+   do(tail(., n=1))
# A tibble: 20 × 3
# Groups:   Position [20]
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        1 2021-12-30 163917
 2        2 2021-11-30 969631
 3        3 2021-10-29 153590
 4        4 2021-09-30  97777
 5        5 2021-08-31 188115
 6        6 2022-07-01   5277
 7        7 2022-06-30  24808
 8        8 2022-05-31  28236
 9        9 2022-04-29   1499
10       10 2022-03-31    197
11       11 2022-02-25    500
12       12 2022-01-31     NA
13       13 2021-12-30     NA
14       14 2015-09-30     NA
15       15 2010-10-29     NA
16       16 2010-09-30     NA
17       17 2010-08-31     NA
18       18 2010-07-30     NA
19       19 2010-03-31     NA
20       20 2009-12-30     NA

位置的范围是：

> range(df$Position)
[1]  1 20

但是如图所示，它可以检索不同的日期，而不是最后一个。这里的日期范围是：

> range(df$Date)
[1] "2003-01-02" "2022-07-01"

肯定有一个位置1的数据，最多有12个，如下所示：

> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        7 2022-07-01   2052
 2        8 2022-07-01   2644
 3        9 2022-07-01    357
 4       10 2022-07-01    260
 5       11 2022-07-01    491
 6       12 2022-07-01    100
 7        1 2022-07-01 525635
 8        2 2022-07-01 107201
 9        3 2022-07-01  39664
10        4 2022-07-01  12479
11        5 2022-07-01  12568
12        6 2022-07-01   5277

有人可以帮助我理解它们的工作方式有所不同，以及我该怎么做才能获得与DF2相同的结果？

谢谢！

原文

I have 2 DFs that contains basically a Date, a Volume, and one more column. On one df that column is ticker, and on the other is Position.

> df
# A tibble: 54,885 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        2 2003-01-02   1809
 2        1 2003-01-03   1831
 3        1 2003-01-06   2465
 4        1 2003-01-07   1215
 5        1 2003-01-08    955
 6        1 2003-01-09   1192
 7        1 2003-01-10   1901
 8        1 2003-01-13   2110
 9        1 2003-01-14   2521
10        1 2003-01-15   1704
# … with 54,875 more rows

> df2
# A tibble: 154 × 3
   ticker date           volume
   <chr>  <date>          <dbl>
 1 X29L2  2022-06-01  34015836 
 2 X29L2  2022-06-02 255554864 
 3 X29L2  2022-06-03  52779492 
 4 X29L2  2022-06-06 846971456 
 5 X29L2  2022-06-07 433462592 
 6 X29L2  2022-06-08     31365.
 7 X29L2  2022-06-09   8364060.
 8 X29L2  2022-06-10   7550020 
 9 X29L2  2022-06-13  39714244 
10 X29L2  2022-06-14 821900608 
# … with 144 more rows

I want for each Position, and for each ticker, the last Dates available.

Let's start with df2:

> df2 %>% 
+   group_by(ticker) %>%
+   do(tail(., n=1))
# A tibble: 7 × 3
# Groups:   ticker [7]
  ticker date           volume
  <chr>  <date>          <dbl>
1 X16D2  2022-07-04 115893395.
2 X16G2  2022-07-04 434399604.
3 X17F3  2022-07-04 206883540.
4 X19Y3  2022-07-04 317255104.
5 X20E3  2022-07-04 291394381.
6 X21O2  2022-07-04 186407123.
7 X29L2  2022-07-04  69635266.

That works great. The range of Dates is

> range(df2$date)
[1] "2022-06-01" "2022-07-04"

and this is the correct output I was expecting.

Let's do the same with df:

> df %>%   
+   group_by(Position) %>%
+   do(tail(., n=1))
# A tibble: 20 × 3
# Groups:   Position [20]
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        1 2021-12-30 163917
 2        2 2021-11-30 969631
 3        3 2021-10-29 153590
 4        4 2021-09-30  97777
 5        5 2021-08-31 188115
 6        6 2022-07-01   5277
 7        7 2022-06-30  24808
 8        8 2022-05-31  28236
 9        9 2022-04-29   1499
10       10 2022-03-31    197
11       11 2022-02-25    500
12       12 2022-01-31     NA
13       13 2021-12-30     NA
14       14 2015-09-30     NA
15       15 2010-10-29     NA
16       16 2010-09-30     NA
17       17 2010-08-31     NA
18       18 2010-07-30     NA
19       19 2010-03-31     NA
20       20 2009-12-30     NA

The range of Position is:

> range(df$Position)
[1]  1 20

But as it is shown, it retrieves different Dates and they're not the last ones.
the range of Dates here is:

> range(df$Date)
[1] "2003-01-02" "2022-07-01"

And there is certainly data for Position 1 up to 12 as shown below:

> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        7 2022-07-01   2052
 2        8 2022-07-01   2644
 3        9 2022-07-01    357
 4       10 2022-07-01    260
 5       11 2022-07-01    491
 6       12 2022-07-01    100
 7        1 2022-07-01 525635
 8        2 2022-07-01 107201
 9        3 2022-07-01  39664
10        4 2022-07-01  12479
11        5 2022-07-01  12568
12        6 2022-07-01   5277

Anyone can help me understand why they worked so differently and what can I do to get the same result as in df2?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

提笔落墨 2025-02-19 17:59:22

在调用tail do 中，可能是不安排的情况。我们要么安排按'位置'，'日期'的数据，然后使用slice_tail by'tocare'进行组

library(dplyr)
df %>% 
   arrange(Position, Date) %>%
   group_by(Position) %>% 
   slice_tail(n = 1) %>%
   # or may use do, but it can be slow
   # do(tail(., n = 1))
   ungroup

，或者直接使用slice_max没有安排 in

df %>%
   group_by(Position) %>%
   slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
   ungroup

在desc结束订单中的之后>返回第一个唯一

df %>%
   arrange(Position, desc(Date)) %>%
   distinct(Position, .keep_all = TRUE)

值此方法的优势是它不需要分组，然后ungroup（删除组属性）

It may be a case of not arrangeing the data before calling the tail within do. Either we arrange the data by 'Position', 'Date' and then do a group by 'Position' with slice_tail

library(dplyr)
df %>% 
   arrange(Position, Date) %>%
   group_by(Position) %>% 
   slice_tail(n = 1) %>%
   # or may use do, but it can be slow
   # do(tail(., n = 1))
   ungroup

Or directly use slice_max without arrangeing

df %>%
   group_by(Position) %>%
   slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
   ungroup

Or may also use distinct after arrangeing in descending order as distinct returns the first unique value

df %>%
   arrange(Position, desc(Date)) %>%
   distinct(Position, .keep_all = TRUE)

The advantage with this method is it doesn't require grouping and then ungroup (to remove the group attribute)

回复收藏 0 原文

~没有更多了~

关于作者

愁杀

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在组中使用尾巴（）使我有些困惑

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

在组中使用尾巴（）使我有些困惑

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。