使用 plyr、doMC 和 summarise() 处理非常大的数据集?
我有一个相当大的数据集(约 140 万行),我正在对其进行一些拆分和总结。整个过程需要一段时间才能运行,我的最终应用程序依赖于频繁运行,所以我的想法是使用 doMC
和 .parallel=TRUE
标志与 plyr 像这样(稍微简化一下):
library(plyr)
require(doMC)
registerDoMC()
df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
如果我将核心数量显式设置为 2(使用 registerDoMC(cores=2)
),我的 8 GB RAM 可以帮助我完成任务,并且可以节省相当多的时间。但是,如果我让它使用全部 8 个核心,我很快就会耗尽内存,因为每个分叉进程似乎都会克隆内存中的整个数据集。
我的问题是是否可以以更节省内存的方式使用 plyr 的并行执行设施?我尝试将我的数据帧转换为 big.matrix,但这似乎只是迫使整个事情回到使用单核:
library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)
bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
这是我第一次涉足多核 R 计算,所以如果有更好的思考这个问题的方式,我愿意接受建议。
更新:与生活中的许多事情一样,事实证明我在代码的其他地方做了其他愚蠢的事情,并且多重处理的整个问题在这个特定的实例中成为一个有争议的问题。但是,对于大数据折叠任务,我会记住 data.table
。我能够以简单的方式复制我的折叠任务。
I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC
and the .parallel=TRUE
flag with plyr like so (simplified a bit):
library(plyr)
require(doMC)
registerDoMC()
df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
If I set the number of cores explicitly to two (using registerDoMC(cores=2)
) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.
My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix
, but this simply seemed to force the whole thing back to using a single core:
library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)
bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.
UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table
in mind. I was able to replicate my folding task in a straightforward way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不认为 plyr 会复制整个数据集。然而,当处理一块数据时,该子集会被复制到工作线程中。因此,当使用更多工作线程时,内存中同时存在更多子集(即 8 个而不是 2 个)。
我可以想到一些您可以尝试的技巧:
data.table
,在某些情况下这可能会导致速度提高几个数量级。我不确定 data.table 是否支持并行处理,但即使没有并行化,data.table 也可能快数百倍。请参阅我的博客文章 比较ave
、ddply
和data.table
处理数据块的情况。I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).
I can think of a few tips you could try:
data.table
a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparingave
,ddply
anddata.table
for processing chunks of data.