在 R 中使用 MNP 包时出现内存泄漏
我有一个关于使用 MNP 包时 R 中内存使用的问题。我的目标是估计多项概率模型,然后使用该模型来预测大量数据的选择。我已将预测数据拆分为多个部分。
问题是,当我循环列表进行预测时,R 使用的内存不断增长,并在达到我的计算机的最大内存后使用交换空间。即使达到这些边界,分配的内存也不会被释放。即使我没有创建任何其他对象,也会发生这种情况,所以我不明白发生了什么。
下面我粘贴了一个遇到所描述问题的示例代码。运行该示例时,内存不断增长,并且即使在删除所有变量并调用 gc()
后仍保持使用状态。
我拥有的真实数据比示例中生成的数据大得多,因此我需要找到解决方法。
我的问题是:
为什么这个脚本使用这么多内存?
如何强制 R 在每一步后释放分配的内存?
library(MNP)
nr <- 10000
draws <- 500
pieces <- 100
# Create artificial training data
trainingData <- data.frame(y = sample(c(1,2,3), nr, rep = T), x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
# Create artificial predictor data
predictorData <- list()
for(i in 1:pieces){
predictorData[[i]] <- data.frame(y = NA, x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
}
# Estimate multinomial probit
mnp.out <- mnp(y ~ x1 + x2, trainingData, n.draws = draws)
# Predict using predictor data
predicted <- list()
for(i in 1:length(predictorData)){
cat('|')
mnp.pred <- predict(mnp.out, predictorData[[i]], type = 'prob')$p
mnp.pred <- colnames(mnp.pred)[apply(mnp.pred, 1, which.max)]
predicted[[i]] <- mnp.pred
rm(mnp.pred)
gc()
}
# Unite output into one string
predicted <- factor(unlist(predicted))
以下是运行脚本后的输出统计信息:
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 158950 8.5 407500 21.8 407500 21.8
Vcells 142001 1.1 33026373 252.0 61418067 468.6
这是我的 R 规格:
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MNP_2.6-2 MASS_7.3-14
I have a question concerning memory use in R when using the MNP package. My goal is to estimate a multinomial probit model and then using the model to predict choices on a large set of data. I have split the predictor data in a list of pieces.
The problem is that when I loop over the list to predict, the memory used by R grows constantly and uses swap space after reaching the maximum memory of my computer. The allocated memory is not released even when hitting those boundaries. This happens even though I do not create any additional objects and so I don't understand what is going on.
Below I pasted an example code that suffers from the described problem. When running the example, the memory grows constantly and remains used even after removing all variables and calling gc()
.
The real data I have is much larger than what is generated in the example, so I need to find a workaround.
My questions are:
Why does this script use so much memory?
How can force R to release the allocated memory after each step?
library(MNP)
nr <- 10000
draws <- 500
pieces <- 100
# Create artificial training data
trainingData <- data.frame(y = sample(c(1,2,3), nr, rep = T), x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
# Create artificial predictor data
predictorData <- list()
for(i in 1:pieces){
predictorData[[i]] <- data.frame(y = NA, x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
}
# Estimate multinomial probit
mnp.out <- mnp(y ~ x1 + x2, trainingData, n.draws = draws)
# Predict using predictor data
predicted <- list()
for(i in 1:length(predictorData)){
cat('|')
mnp.pred <- predict(mnp.out, predictorData[[i]], type = 'prob')$p
mnp.pred <- colnames(mnp.pred)[apply(mnp.pred, 1, which.max)]
predicted[[i]] <- mnp.pred
rm(mnp.pred)
gc()
}
# Unite output into one string
predicted <- factor(unlist(predicted))
Here are the output statistics after running the script:
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 158950 8.5 407500 21.8 407500 21.8
Vcells 142001 1.1 33026373 252.0 61418067 468.6
Here are my specifications of R:
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MNP_2.6-2 MASS_7.3-14
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
结果似乎并不异常,因为我不认为这表明存在内存泄漏。我怀疑您误读了 gc() 中的数据:右侧的列是 R 跟踪内存期间使用的最大内存。如果您使用 gc(reset = TRUE) ,那么显示的最大值将是 LHS 中使用的内存,即“已使用”下列出的 8.5MB 和 1.1MB。
我怀疑 MNP 在预测阶段会消耗大量内存,因此除了将预测数据分解为更小的块、更少的行之外,没有什么可做的。
如果您有多个内核,您可以考虑使用
foreach
包以及doSMP
或doMC
,因为这将为您提供独立的加速计算以及在循环的每次迭代完成后清除分配的 RAM 的好处(我相信它涉及使用单独内存空间的 R 分支)。The results don't seem anomalous, as in I don't think this evidences a memory leak. I suspect that you're misreading the data from
gc()
: the right hand column is the maximum memory used during the tracking of memory by R. If you usegc(reset = TRUE)
, then the maximum shown will be the memory used in the LHS, i.e. the 8.5MB and 1.1MB listed under "used".I suspect that MNP just consumes a lot of memory during the prediction phase, so there's not much that can be done, other than to break up the prediction data into even smaller chunks, with fewer rows.
If you have multiple cores, you might consider using the
foreach
package, along withdoSMP
ordoMC
, as this will give you the speedup of independent calculations and the benefit of clearing the RAM allocated after each iteration of the loop is complete (as it involves a fork of R that uses a separate memory space, I believe).