按行数拆分数据框
我有一个由 400'000 行和大约 50 列组成的数据框。由于这个数据帧太大,处理起来计算量太大。 我想将此数据帧分割成更小的数据帧,然后运行我想要运行的函数,然后最后重新组装数据帧。
我没有想用来分割此数据框的分组变量。我只想按行数将其拆分。例如,我想将这个 400'000 行表拆分为 400 个 1'000 行数据帧。 我该怎么做?
I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
创建您自己的分组变量。
您还应该考虑
plyr
包中的ddply
函数,或dplyr
中的group_by()
函数。在哈德利发表评论后,为了简洁起见,进行了编辑。
如果您不知道数据帧中有多少行,或者数据帧的长度可能与您所需的块大小不相等,您可以这样做
您还可以使用
基于 dplyr 的方法对于未来的读者 和
data.table
包对于对数据帧进行分组操作可能会(快得多)快,例如类似还有很多答案此处
Make your own grouping variable.
You should also consider the
ddply
function from theplyr
package, or thegroup_by()
function fromdplyr
.edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
You could also use
For future readers, methods based on the
dplyr
anddata.table
packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something likeThere are also many answers here
我有一个类似的问题并使用了这个:
从左到右:
split
df
作为输入数据帧开始row_number
乘以n
(组数)。group_map
函数即可。因此,最终您的
split
是一个列表,每个元素中都有一组数据集。另一方面,您也可以通过将
group_map
调用替换为group_walk(~ write_csv(.x, Paste0("file_", .y, ".csv") 来立即写入数据)))
。您可以在以下位置找到有关这些强大工具的更多信息:
解释 group_by 的 dplyr 备忘单
以及下面的:
group_map、group_walk后续功能
I had a similar question and used this:
from left to right:
split
df
as your input dataframerow_number
byn
(number of groups) using modular division.group_map
function which returns a list.So in the end your
split
is a list with in each element a group of your dataset.On the other hand, you could also immediately write your data by replacing the
group_map
call by e.g.group_walk(~ write_csv(.x, paste0("file_", .y, ".csv")))
.You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions