R：将数据框的每一行转换为列表项

发布于 2024-10-19 15:37:05 字数 578 浏览 10 评论 0原文

我对数据帧有许多操作，我想使用 mclapply() 或其他类似 lapply() 的函数来加速这些操作。对我来说，解决这个问题最简单的方法之一就是使数据框的每一行成为列表中的一个小数据框。我可以使用 plyr 轻松完成此操作，如下所示：

df <- data.frame( a=rnorm(1e4), b=rnorm(1e4))
require(plyr)
system.time(myList <- alply( df, 1, function(x) data.frame(x) ))

一旦我将数据作为列表，我就可以轻松执行以下操作：

mclapply( myList, function(x) doSomething(x$a) )

这工作顺利，但我有相当多的数据和 adply() 步骤非常慢。我尝试在 adply 步骤中使用多核并行后端，但它从未使用超过一个处理器，即使我已经注册了 8 个。我怀疑并行选项可能无法解决此类问题。

关于如何使其更快的任何提示？也许是一个基本的 R 解决方案？

原文

I have a number of operations on data frames which I would like to speed up using mclapply() or other lapply() like functions. One of the easiest ways for me to wrestle with this is to make each row of the data frame a small data frame in a list. I can do this pretty easily with plyr like this:

df <- data.frame( a=rnorm(1e4), b=rnorm(1e4))
require(plyr)
system.time(myList <- alply( df, 1, function(x) data.frame(x) ))

Once I have my data as a list I can easily do things like:

mclapply( myList, function(x) doSomething(x$a) )

This works swimmingly, but I have quite a lot of data and the adply() step is quite slow. I tried using the multicore parallel backend on the adply step, but it never used more than one processor even though I had registered 8. I'm suspicious the parallel option may not work with this type of problem.

Any tips on how to make this faster? Maybe a base R solution?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乱世争霸 2024-10-26 15:37:05

只需使用split即可。它比您的 adply 行快几倍。

> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
   7.53    0.00    7.57 
> system.time( splitList <- split(df, 1:NROW(df)) )
   user  system elapsed 
   1.73    0.00    1.74 
>

我怀疑 adply 上的并行后端仅用于函数评估（而不是拆分和重新组合）。

更新：
如果您可以将 data.frame 转换为矩阵，下面的解决方案将非常快。您也许可以使用 split，但它会删除名称并在每个列表元素中返回一个向量。

> m <- as.matrix(df)
> system.time( matrixList <- lapply(1:NROW(m), function(i) m[i,,drop=FALSE]) )
   user  system elapsed 
   0.02    0.00    0.02
> str(matrixList[[1]])
 num [1, 1:2] -0.0956 -1.5887
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> system.time( matrixSplitList <- split(m, 1:NROW(m)) )
   user  system elapsed 
   0.01    0.00    0.02 
> str(matrixSplitList[[1]])
 num [1:2] -0.0956 -1.5887

Just use split. It's a few times faster than your adply line.

> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
   7.53    0.00    7.57 
> system.time( splitList <- split(df, 1:NROW(df)) )
   user  system elapsed 
   1.73    0.00    1.74 
>

I suspect the parallel backend on adply is only for function evaluation (not splitting and re-combining).

UPDATE:
If you can convert your data.frame to a matrix, the solution below will be über-fast. You may be able to use split, but it will drop names and return a vector in each list element.

> m <- as.matrix(df)
> system.time( matrixList <- lapply(1:NROW(m), function(i) m[i,,drop=FALSE]) )
   user  system elapsed 
   0.02    0.00    0.02
> str(matrixList[[1]])
 num [1, 1:2] -0.0956 -1.5887
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> system.time( matrixSplitList <- split(m, 1:NROW(m)) )
   user  system elapsed 
   0.01    0.00    0.02 
> str(matrixSplitList[[1]])
 num [1:2] -0.0956 -1.5887

回复收藏 0 原文

夏末的微笑 2024-10-26 15:37:05

这个怎么样？

jdList <- split(df, 1:nrow(df))

> class(jdList[[1]])
[1] "data.frame"

> system.time(jdList <- split(df, 1:nrow(df)))
   user  system elapsed 
   1.67    0.02    1.70 
> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
    7.2     0.0     7.3

How about this?

jdList <- split(df, 1:nrow(df))

> class(jdList[[1]])
[1] "data.frame"

> system.time(jdList <- split(df, 1:nrow(df)))
   user  system elapsed 
   1.67    0.02    1.70 
> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
    7.2     0.0     7.3

回复收藏 0 原文

~没有更多了~