使用dplyr group_by模拟split():返回数据帧列表
我有一个大的数据集,可以在R中窒息split()
。 grouped_df
作为数据帧列表,这是我的连续处理步骤所要求的格式(我需要胁迫spatialdataframes
以及类似)。
考虑一个示例数据集:
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)
返回
$a
V1 V2 V3
1 a 1 2
2 a 2 3
$b
V1 V2 V3
3 b 3 4
4 b 4 2
$c
V1 V2 V3
5 c 5 2
我想使用group_by
(类似group_by(df,v1)
)来模拟此数据集,但这会返回一个,grouped_df
。我知道do
应该能够帮助我,但我不确定用法(另请参见链接进行讨论。)
请注意,将每个列表拆分为已用于建立该组的因素的名称 - 这是一个理想的功能(最终,是从DFS列表中提取这些名称的方法的奖励荣誉)。
I have a large dataset that chokes split()
in R. I am able to use dplyr
group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df
as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames
and similar).
consider a sample dataset:
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)
returns
$a
V1 V2 V3
1 a 1 2
2 a 2 3
$b
V1 V2 V3
3 b 3 4
4 b 4 2
$c
V1 V2 V3
5 c 5 2
I would like to emulate this with group_by
(something like group_by(df,V1)
) but this returns one, grouped_df
. I know that do
should be able to help me, but I am unsure about usage (also see link for a discussion.)
Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
dplyr中的group_split:
dplyr已实现
group_split
:https://dplyr.tidyverse.org/reference/group_split.split.html 组成,返回数据范围列表。这些数据范围中的每一个都是由拆分变量类别定义的原始数据框的子集。
例如。将数据集
iris
按变量stell
,并计算每个子数据集的摘要:对于在嵌套数据范围内调试计算也非常有用,因为它是一个快速的“查看”嵌套数据框架上的计算中正在发生的事情。
group_split in dplyr:
Dplyr has implemented
group_split
:https://dplyr.tidyverse.org/reference/group_split.html
It splits a dataframe by a groups, returns a list of dataframes. Each of these dataframes are subsets of the original dataframes defined by categories of the splitting variable.
For example. Split the dataset
iris
by the variableSpecies
, and calculate summaries of each sub-dataset:It is also very helpful for debugging a calculations on nested dataframes, because it is an quick way to "see" what is going on "inside" the calculations on nested dataframes.
比较基础,
plyr
和dplyr
解决方案,似乎基本的速度仍然快得多!给出:
Comparing the base,
plyr
anddplyr
solutions, it still seems the base one is much faster!Gives:
要“坚持”到Dplyr,您也可以使用
plyr
而不是split
:To 'stick' to dplyr, you can also use
plyr
instead ofsplit
:您可以使用
do
从group_by
获取数据帧列表,只要您命名新列,将存储数据框,然后将该列将其置于> Lapply
。You can get a list of data frames from
group_by
usingdo
as long as you name the new column where the data frames will be stored and then pipe that column intolapply
.由于 dplyr 0.8 您可以使用
group_split
Since dplyr 0.8 you can use
group_split
由于
dplyr 0.5.0.9000
,因此使用group_by()
的最短解决方案可能是使用do
使用pull
code> do :请注意,与
split
不同,这没有命名结果列表元素。如果需要这一点,那么您可能希望某种东西可以稍作编辑,我同意人们说
split()
是更好的选择。就我个人而言,我总是发现我必须两次键入数据框的名称(例如split(潜在的longnname,潜在的londlongname $ v1)
))很烦人,但是这个问题很容易与管道挂钩:Since
dplyr 0.5.0.9000
, the shortest solution that usesgroup_by()
is probably to followdo
with apull
:Note that, unlike
split
, this doesn't name the resulting list elements. If this is desired, then you would probably want something likeTo editorialize a little, I agree with the folks saying that
split()
is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g.,split( potentiallylongname, potentiallylongname$V1 )
), but the issue is easily sidestepped with the pipe:使用
group_map
pergroup_by
的另一个选项,如果要将每个组的名称保留到每个数据框架列表中,则可以使用set_names
从purrr 像这样:
在2023-03-04上创建的 reprex v2.0.2
Another option using
group_map
pergroup_by
and if you want to keep the names per group to each list of dataframe, you could useset_names
frompurrr
like this:Created on 2023-03-04 with reprex v2.0.2