通过使用Tidyverse/其他R命令分组变量，查找最长的数据条目行

发布于 2025-01-22 18:44:35 字数 667 浏览 1 评论 0原文

我不确定我是否用正确的标题来描述我的问题，但是我想

在使用group_by（）之后，我想最长的每个组数据条目。当前行顺序。换句话说，组内有一个（或多个）不连续性（例如Archep> Archep（）由其他一些列）。我想获得一个新的列（例如mutate（）），该列标记每个组最长范围内的行。以下是一个示例：

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17))

其中，我想获得以下数据框架：

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17),
           longest = c(T, T, T, F, F, T, T, T, T, T, F, F, F, F, T, T, T))

原文

I am not sure if I describe my question with the correct title but the idea is:

I would like to longest stretch of rows of data entries of each group after using group_by() which is also sensitive to the current order of rows. In other words, there are a (or multiple) discontinuities within a group (e.g. after arrange() by some other columns). I would like to get a new column (e.g. mutate()) that labels the rows that are within the longest stretch of each group. below is an example:

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17))

In which, I would like to get a data frame like the following:

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17),
           longest = c(T, T, T, F, F, T, T, T, T, T, F, F, F, F, T, T, T))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酷炫老祖宗 2025-01-29 18:44:35

在基础R中：

df$longest <- with(rle(df$group), 
                    rep(ave(lengths, values, FUN = max) == lengths,lengths))

df
   group order longest
1      1     1    TRUE
2      1     2    TRUE
3      1     3    TRUE
4      2     4   FALSE
5      2     5   FALSE
6      3     6    TRUE
7      3     7    TRUE
8      3     8    TRUE
9      3     9    TRUE
10     3    10    TRUE
11     1    11   FALSE
12     1    12   FALSE
13     3    13   FALSE
14     1    14   FALSE
15     2    15    TRUE
16     2    16    TRUE
17     2    17    TRUE

另一个基本R：

a <- rle(df$group)
a$values <- ave(a$lengths, a$values, FUN = max) == a$lengths

df$longest <- inverse.rle(a)

在Data.Table中：

library(data.table)
setDT(df)[, N := .N, by = rleid(group)][, longest := N == max(N), by = group][]

   group order N longest
 1:     1     1 3    TRUE
 2:     1     2 3    TRUE
 3:     1     3 3    TRUE
 4:     2     4 2   FALSE
 5:     2     5 2   FALSE
 6:     3     6 5    TRUE
 7:     3     7 5    TRUE
 8:     3     8 5    TRUE
 9:     3     9 5    TRUE
10:     3    10 5    TRUE
11:     1    11 2   FALSE
12:     1    12 2   FALSE
13:     3    13 1   FALSE
14:     1    14 1   FALSE
15:     2    15 3    TRUE
16:     2    16 3    TRUE
17:     2    17 3    TRUE

in Base R:

df$longest <- with(rle(df$group), 
                    rep(ave(lengths, values, FUN = max) == lengths,lengths))

df
   group order longest
1      1     1    TRUE
2      1     2    TRUE
3      1     3    TRUE
4      2     4   FALSE
5      2     5   FALSE
6      3     6    TRUE
7      3     7    TRUE
8      3     8    TRUE
9      3     9    TRUE
10     3    10    TRUE
11     1    11   FALSE
12     1    12   FALSE
13     3    13   FALSE
14     1    14   FALSE
15     2    15    TRUE
16     2    16    TRUE
17     2    17    TRUE

Another Base R:

a <- rle(df$group)
a$values <- ave(a$lengths, a$values, FUN = max) == a$lengths

df$longest <- inverse.rle(a)

In data.table:

library(data.table)
setDT(df)[, N := .N, by = rleid(group)][, longest := N == max(N), by = group][]

   group order N longest
 1:     1     1 3    TRUE
 2:     1     2 3    TRUE
 3:     1     3 3    TRUE
 4:     2     4 2   FALSE
 5:     2     5 2   FALSE
 6:     3     6 5    TRUE
 7:     3     7 5    TRUE
 8:     3     8 5    TRUE
 9:     3     9 5    TRUE
10:     3    10 5    TRUE
11:     1    11 2   FALSE
12:     1    12 2   FALSE
13:     3    13 1   FALSE
14:     1    14 1   FALSE
15:     2    15 3    TRUE
16:     2    16 3    TRUE
17:     2    17 3    TRUE

回复收藏 0 原文

莳間冲淡了誓言ζ 2025-01-29 18:44:35

我们可以在group列中为连续值创建一个组。然后，获取这些组的行数，然后我们可以通过group进行分组，然后返回 true ，对于每个组的连续行数量最多的行。

library(tidyverse)
  
df %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest == max(longest)) %>% 
  ungroup %>% 
  select(-group_weight)

输出

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE

如果您在连续行之间有平局，并且只想将第一个分组返回为t，那么您可以做类似的事情：

df2 %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest==max(longest)) %>% 
  group_by(longest, .add = TRUE) %>% 
  mutate(x = min(group_weight)) %>% 
  ungroup(longest) %>% 
  mutate(longest = longest == TRUE & group_weight == x & !is.na(x)) %>% 
  ungroup %>% 
  dplyr::select(-c(group_weight, x))

output

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE   
18     1    18 FALSE  
19     1    19 FALSE  
20     1    20 FALSE

数据

df2 <- structure(list(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 
3, 1, 2, 2, 2, 1, 1, 1), order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)), class = "data.frame", row.names = c(NA, 
-20L))

We could create a group for the consecutive values in the group column. Then, get the number of rows for those groups, then we can group by group and return TRUE for the rows that have the greatest number of consecutive rows for each group.

library(tidyverse)
  
df %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest == max(longest)) %>% 
  ungroup %>% 
  select(-group_weight)

Output

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE

If you have a tie among consecutive rows and only want to return the first grouping as T, then you could do something like this:

df2 %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest==max(longest)) %>% 
  group_by(longest, .add = TRUE) %>% 
  mutate(x = min(group_weight)) %>% 
  ungroup(longest) %>% 
  mutate(longest = longest == TRUE & group_weight == x & !is.na(x)) %>% 
  ungroup %>% 
  dplyr::select(-c(group_weight, x))

Output

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE   
18     1    18 FALSE  
19     1    19 FALSE  
20     1    20 FALSE

Data

df2 <- structure(list(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 
3, 1, 2, 2, 2, 1, 1, 1), order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)), class = "data.frame", row.names = c(NA, 
-20L))

回复收藏 0 原文

~没有更多了~