使用 plyr、doMC 和 summarise() 处理非常大的数据集?

发布于 2024-12-23 06:04:28 字数 961 浏览 5 评论 0原文

我有一个相当大的数据集(约 140 万行),我正在对其进行一些拆分和总结。整个过程需要一段时间才能运行,我的最终应用程序依赖于频​​繁运行,所以我的想法是使用 doMC.parallel=TRUE 标志与 plyr 像这样(稍微简化一下):

library(plyr)
require(doMC)
registerDoMC()

df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

如果我将核心数量显式设置为 2(使用 registerDoMC(cores=2)),我的 8 GB RAM 可以帮助我完成任务,并且可以节省相当多的时间。但是,如果我让它使用全部 8 个核心,我很快就会耗尽内存,因为每个分叉进程似乎都会克隆内存中的整个数据集。

我的问题是是否可以以更节省内存的方式使用 plyr 的并行执行设施?我尝试将我的数据帧转换为 big.matrix,但这似乎只是迫使整个事情回到使用单核:

library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)

bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

这是我第一次涉足多核 R 计算,所以如果有更好的思考这个问题的方式,我愿意接受建议。

更新:与生活中的许多事情一样,事实证明我在代码的其他地方做了其他愚蠢的事情,并且多重处理的整个问题在这个特定的实例中成为一个有争议的问题。但是,对于大数据折叠任务,我会记住 data.table。我能够以简单的方式复制我的折叠任务。

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

library(plyr)
require(doMC)
registerDoMC()

df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)

bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

如歌彻婉言 2024-12-30 06:04:28

我不认为 plyr 会复制整个数据集。然而,当处理一块数据时,该子集会被复制到工作线程中。因此,当使用更多工作线程时,内存中同时存在更多子集(即 8 个而不是 2 个)。

我可以想到一些您可以尝试的技巧:

  • 将数据放入数组结构而不是 data.frame 中,并使用 adply 进行汇总。数组在内存使用和速度方面要高效得多。我的意思是使用普通矩阵,而不是大矩阵。
  • 尝试一下data.table,在某些情况下这可能会导致速度提高几个数量级。我不确定 data.table 是否支持并行处理,但即使没有并行化,data.table 也可能快数百倍。请参阅我的博客文章 比较 aveddplydata.table 处理数据块的情况。

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

I can think of a few tips you could try:

  • Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
  • Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文