rbind.data.frame 的性能

发布于 2024-11-06 14:05:20 字数 920 浏览 4 评论 0原文

我有一个数据帧列表,我确信它们都至少包含一行(事实上,有些只包含一行,其他包含给定数量的行),并且它们都具有相同的列(名称和类型)。万一重要的话,我也确信行中的任何地方都没有 NA。

这种情况可以这样模拟:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

我已经设置了(随机化)参数,以便它们接近我的真实情况。

现在,我想将所有这些数据帧合并到一个数据帧中。我认为使用 rbind 可以解决这个问题,如下所示:

system.time(
result<-do.call(rbind, someParts)
)

现在,在我的系统上(不是特别慢),并且使用上面的设置,这需要的是 system.time 的输出:

   user  system elapsed 
   5.61    0.00    5.62

近 6 秒 for rbind-ing 254 (就我而言)200 个变量的行?当然必须有一种方法来提高这里的性能吗?在我的代码中,我必须经常做类似的事情(这是多重插补的结果),所以我需要尽可能快。

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

半边脸i 2024-11-13 14:05:21

您可以仅使用数字变量构建矩阵并在最后转换为因子吗? rbind 在数字矩阵上要快得多。

在我的系统上,使用数据框:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636 

改为使用所有数字矩阵构建列表:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

导致 rbind 速度更快。

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

编辑:这是另一种可能性;它只是依次组合每一列。

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813  

但仍然不如使用矩阵快。

编辑2:

如果您只有数字和因子,那么将所有内容转换为数字,rbind它们,并将必要的列转换回因子并不难。这假设所有因素都具有完全相同的水平。从整数转换为因子也比从数字转换为因子更快,因此我首先强制转换为整数。

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

我的系统上的时间是:

   user  system elapsed 
   0.090    0.00    0.091 

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636 

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813  

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091 
女皇必胜 2024-11-13 14:05:21

虽然不是一个巨大的提升,但将 plyr 包中的 rbind 替换为 rbind.fill 可以将运行时间缩短约 10%(使用示例数据集) ,在我的机器上)。

Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).

梦中楼上月下 2024-11-13 14:05:21

如果您确实想更快地操作 data.frame,我建议使用包 data.table 和函数 rbindlist() 。我没有执行广泛的测试,但对于我的数据集(3000 个数据帧,每个数据帧 1000 行 x 40 列),rbindlist() 只需 20 秒。

If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.

浮萍、无处依 2024-11-13 14:05:21

这快了约 25%,但必须有更好的方法......

system.time({
  N <- do.call(sum, lapply(someParts, nrow))
  SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
  k <- 0
  for(i in 1:length(someParts)) {
    j <- k+1
    k <- k + nrow(someParts[[i]])
    SP[j:k,] <- someParts[[i]]
  }
})

This is ~25% faster, but there has to be a better way...

system.time({
  N <- do.call(sum, lapply(someParts, nrow))
  SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
  k <- 0
  for(i in 1:length(someParts)) {
    j <- k+1
    k <- k + nrow(someParts[[i]])
    SP[j:k,] <- someParts[[i]]
  }
})
木落 2024-11-13 14:05:21

确保将数据框绑定到数据框。将列表绑定到数据帧时遇到巨大的性能下降。

Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

べ映画 2024-11-13 14:05:21

在 ecospace 包中,rbind_listdf 一次可处理 100 个数据帧的块。与 do.call(rbind) 相比,它似乎比合并数百个数据帧的列表更节省时间和内存。对于合并 5000 个总大小约为 5GB 的数据帧,我发现峰值内存使用量减少了约 25%。

From the ecospace package, rbind_listdf works on chunks of 100 dataframes at a time. Compared to do.call(rbind) it seems to be more time and memory efficient than if you are merging a list of several hundred dataframes. For merging 5000 dataframes of ~5GB total size, I saw peak memory use was ~25% less.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文