当前位置：文江博客话题详情

按行数拆分数据框

发布于 2024-11-29 13:17:38 字数 187 浏览 0 评论 0原文

我有一个由 400'000 行和大约 50 列组成的数据框。由于这个数据帧太大，处理起来计算量太大。我想将此数据帧分割成更小的数据帧，然后运行我想要运行的函数，然后最后重新组装数据帧。

我没有想用来分割此数据框的分组变量。我只想按行数将其拆分。例如，我想将这个 400'000 行表拆分为 400 个 1'000 行数据帧。我该怎么做？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凑诗 2024-12-06 13:17:38

创建您自己的分组变量。

d <- split(my_data_frame,rep(1:400,each=1000))

您还应该考虑 plyr 包中的 ddply 函数，或 dplyr 中的 group_by() 函数。

在哈德利发表评论后，为了简洁起见，进行了编辑。

如果您不知道数据帧中有多少行，或者数据帧的长度可能与您所需的块大小不相等，您可以这样做

chunk <- 1000
n <- nrow(my_data_frame)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)

您还可以使用

r <- ggplot2::cut_width(1:n,chunk,boundary=0)

基于 dplyr 的方法对于未来的读者和 data.table 包对于对数据帧进行分组操作可能会（快得多）快，例如类似

(my_data_frame 
   %>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
   %>% group_by(index)
   %>% [mutate, summarise, do()] ...
)

还有很多答案此处

Make your own grouping variable.

d <- split(my_data_frame,rep(1:400,each=1000))

You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.

edited for brevity, after Hadley's comments.

If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do

chunk <- 1000
n <- nrow(my_data_frame)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)

You could also use

r <- ggplot2::cut_width(1:n,chunk,boundary=0)

For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like

(my_data_frame 
   %>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
   %>% group_by(index)
   %>% [mutate, summarise, do()] ...
)

There are also many answers here

回复收藏 0 原文

甜尕妞 2024-12-06 13:17:38

我有一个类似的问题并使用了这个：

library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)

从左到右：

将结果分配给 split
你从 df 作为输入数据帧开始
，然后通过划分数据来对数据进行分组使用模除法将 row_number 乘以 n（组数）。
然后您只需将该组传递给返回列表的group_map函数即可。

因此，最终您的 split 是一个列表，每个元素中都有一组数据集。
另一方面，您也可以通过将 group_map 调用替换为 group_walk(~ write_csv(.x, Paste0("file_", .y, ".csv") 来立即写入数据）））。

您可以在以下位置找到有关这些强大工具的更多信息：
解释 group_by 的 dplyr 备忘单
以及下面的：
group_map、group_walk后续功能

I had a similar question and used this:

library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)

from left to right:

you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.

So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).

You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions

回复收藏 0 原文

~没有更多了~