使用dplyr group_by模拟split（）：返回数据帧列表

发布于 2025-02-13 11:50:15 字数 859 浏览 4 评论 0原文

我有一个大的数据集，可以在R中窒息split（）。 grouped_df作为数据帧列表，这是我的连续处理步骤所要求的格式（我需要胁迫spatialdataframes以及类似）。

考虑一个示例数据集：

df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)

我想使用group_by（类似group_by（df，v1））来模拟此数据集，但这会返回一个，grouped_df。我知道do应该能够帮助我，但我不确定用法（另请参见链接进行讨论。）

请注意，将每个列表拆分为已用于建立该组的因素的名称 - 这是一个理想的功能（最终，是从DFS列表中提取这些名称的方法的奖励荣誉）。

原文

I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames and similar).

consider a sample dataset:

df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)

returns

I would like to emulate this with group_by (something like group_by(df,V1)) but this returns one, grouped_df. I know that do should be able to help me, but I am unsure about usage (also see link for a discussion.)

Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜血缘 2025-02-20 11:50:15

dplyr中的group_split：

dplyr已实现group_split：

https://dplyr.tidyverse.org/reference/group_split.split.html 组成，返回数据范围列表。这些数据范围中的每一个都是由拆分变量类别定义的原始数据框的子集。

例如。将数据集iris按变量stell，并计算每个子数据集的摘要：

> iris %>% 
+     group_split(Species) %>% 
+     map(summary)
[[1]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  

[[2]]
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  

[[3]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

对于在嵌套数据范围内调试计算也非常有用，因为它是一个快速的“查看”嵌套数据框架上的计算中正在发生的事情。

group_split in dplyr:

Dplyr has implemented group_split:
https://dplyr.tidyverse.org/reference/group_split.html

It splits a dataframe by a groups, returns a list of dataframes. Each of these dataframes are subsets of the original dataframes defined by categories of the splitting variable.

For example. Split the dataset iris by the variable Species, and calculate summaries of each sub-dataset:

> iris %>% 
+     group_split(Species) %>% 
+     map(summary)
[[1]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  

[[2]]
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  

[[3]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

It is also very helpful for debugging a calculations on nested dataframes, because it is an quick way to "see" what is going on "inside" the calculations on nested dataframes.

回复收藏 0 原文

隱形的亼 2025-02-20 11:50:15

比较基础，plyr和dplyr解决方案，似乎基本的速度仍然快得多！

library(plyr)
library(dplyr)   

df <- data_frame(Group1=rep(LETTERS, each=1000),
             Group2=rep(rep(1:10, each=100),26), 
             Value=rnorm(26*1000))

microbenchmark(Base=df %>%
             split(list(.$Group2, .$Group1)),
           dplyr=df %>% 
             group_by(Group1, Group2) %>% 
             do(data = (.)) %>% 
             select(data) %>% 
             lapply(function(x) {(x)}) %>% .[[1]],
           plyr=dlply(df, c("Group1", "Group2"), as.tbl),
           times=50)

给出：

Unit: milliseconds
  expr      min        lq      mean    median        uq       max neval
  Base 12.82725  13.38087  16.21106  14.58810  17.14028  41.67266    50
  dplyr 25.59038 26.66425  29.40503  27.37226  28.85828  77.16062   50
  plyr 99.52911  102.76313 110.18234 106.82786 112.69298 140.97568    50

Comparing the base, plyr and dplyr solutions, it still seems the base one is much faster!

library(plyr)
library(dplyr)   

df <- data_frame(Group1=rep(LETTERS, each=1000),
             Group2=rep(rep(1:10, each=100),26), 
             Value=rnorm(26*1000))

microbenchmark(Base=df %>%
             split(list(.$Group2, .$Group1)),
           dplyr=df %>% 
             group_by(Group1, Group2) %>% 
             do(data = (.)) %>% 
             select(data) %>% 
             lapply(function(x) {(x)}) %>% .[[1]],
           plyr=dlply(df, c("Group1", "Group2"), as.tbl),
           times=50)

Gives:

Unit: milliseconds
  expr      min        lq      mean    median        uq       max neval
  Base 12.82725  13.38087  16.21106  14.58810  17.14028  41.67266    50
  dplyr 25.59038 26.66425  29.40503  27.37226  28.85828  77.16062   50
  plyr 99.52911  102.76313 110.18234 106.82786 112.69298 140.97568    50

回复收藏 0 原文

横笛休吹塞上声 2025-02-20 11:50:15

要“坚持”到Dplyr，您也可以使用plyr而不是split：

library(plyr)

dlply(df, "V1", identity)
#$a
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#$b
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#$c
#  V1 V2 V3
#1  c  5  2

To 'stick' to dplyr, you can also use plyr instead of split:

library(plyr)

dlply(df, "V1", identity)
#$a
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#$b
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#$c
#  V1 V2 V3
#1  c  5  2

回复收藏 0 原文

以为你会在 2025-02-20 11:50:15

您可以使用do从group_by获取数据帧列表，只要您命名新列，将存储数据框，然后将该列将其置于> Lapply。

listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
listDf[[1]]
#[[1]]
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#[[2]]
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#[[3]]
#  V1 V2 V3
#1  c  5  2

You can get a list of data frames from group_by using do as long as you name the new column where the data frames will be stored and then pipe that column into lapply.

listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
listDf[[1]]
#[[1]]
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#[[2]]
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#[[3]]
#  V1 V2 V3
#1  c  5  2

回复收藏 0 原文

欲拥i 2025-02-20 11:50:15

由于 dplyr 0.8 您可以使用group_split

library(dplyr)
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
df %>% group_by(V1) %>% group_split()
#> [[1]]
#> # A tibble: 2 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 a     1     2    
#> 2 a     2     3    
#> 
#> [[2]]
#> # A tibble: 2 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 b     3     4    
#> 2 b     4     2    
#> 
#> [[3]]
#> # A tibble: 1 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 c     5     2

Since dplyr 0.8 you can use group_split

library(dplyr)
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
df %>% group_by(V1) %>% group_split()
#> [[1]]
#> # A tibble: 2 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 a     1     2    
#> 2 a     2     3    
#> 
#> [[2]]
#> # A tibble: 2 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 b     3     4    
#> 2 b     4     2    
#> 
#> [[3]]
#> # A tibble: 1 x 3
#>   V1    V2    V3   
#>   <fct> <fct> <fct>
#> 1 c     5     2

回复收藏 0 原文

德意的啸 2025-02-20 11:50:15

由于dplyr 0.5.0.9000，因此使用group_by（）的最短解决方案可能是使用do使用pull code> do ：

df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)

请注意，与split不同，这没有命名结果列表元素。如果需要这一点，那么您可能希望某种东西可以

df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )

稍作编辑，我同意人们说split（）是更好的选择。就我个人而言，我总是发现我必须两次键入数据框的名称（例如split（潜在的longnname，潜在的londlongname $ v1）））很烦人，但是这个问题很容易与管道挂钩：

df %>% split( .$V1 )

Since dplyr 0.5.0.9000, the shortest solution that uses group_by() is probably to follow do with a pull:

df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)

Note that, unlike split, this doesn't name the resulting list elements. If this is desired, then you would probably want something like

df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )

To editorialize a little, I agree with the folks saying that split() is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g., split( potentiallylongname, potentiallylongname$V1 )), but the issue is easily sidestepped with the pipe:

df %>% split( .$V1 )

回复收藏 0 原文

伴梦长久 2025-02-20 11:50:15

使用group_map per group_by的另一个选项，如果要将每个组的名称保留到每个数据框架列表中，则可以使用set_names从purrr 像这样：

library(dplyr)
df %>% 
  group_by(V1) %>% 
  group_map(~.x)
#> [[1]]
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 1     2    
#> 2 2     3    
#> 
#> [[2]]
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 3     4    
#> 2 4     2    
#> 
#> [[3]]
#> # A tibble: 1 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 5     2

library(purrr)
df %>% 
  group_by(V1) %>% 
  group_map(~.x) %>% 
  set_names(unique(df$V1))
#> $a
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 1     2    
#> 2 2     3    
#> 
#> $b
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 3     4    
#> 2 4     2    
#> 
#> $c
#> # A tibble: 1 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 5     2

^{在2023-03-04上创建的 reprex v2.0.2}

Another option using group_map per group_by and if you want to keep the names per group to each list of dataframe, you could use set_names from purrr like this:

library(dplyr)
df %>% 
  group_by(V1) %>% 
  group_map(~.x)
#> [[1]]
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 1     2    
#> 2 2     3    
#> 
#> [[2]]
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 3     4    
#> 2 4     2    
#> 
#> [[3]]
#> # A tibble: 1 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 5     2

library(purrr)
df %>% 
  group_by(V1) %>% 
  group_map(~.x) %>% 
  set_names(unique(df$V1))
#> $a
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 1     2    
#> 2 2     3    
#> 
#> $b
#> # A tibble: 2 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 3     4    
#> 2 4     2    
#> 
#> $c
#> # A tibble: 1 × 2
#>   V2    V3   
#>   <chr> <chr>
#> 1 5     2

^{Created on 2023-03-04 with reprex v2.0.2}

回复收藏 0 原文

~没有更多了~