有没有办法防止并行:: mclapply()访问全球环境的内容?
R函数Parallel :: mclapply()
是否可以在大型对象驻留在全球环境中的情况下以较大的rstudio会话方式运行?
我发现,当我使用mclapply()
在多个内核上运行分析时,消耗的RAM始终是 (在我的情况下为GB的数十个)交互式RSTUDIO会话比我通过RScript运行完全相同的代码时。我的直觉是因为mclapply()
重复每个核心上的全局环境(我经常在全球环境中大小上有数十GB的对象),并且仅向RScript提供必需对象最小化这个开销。
我正在使用Linux AWS EC2机器,其中大量RAM(例如64 GB至128 GB)和相当大的CPU内核(例如,16-32),我经常发现运行mclapply
在detectcores()-1
几乎立即互动地最大化RAM(在几秒钟内增加了数十GB),而通过RScript运行完全相同的代码几乎没有比调用mclapply()。我已经观察到了各种无关分析的行为,因此我不包括可重复的示例。
要通过RScript运行mclapply
调用,我首先将必要的数据对象保存到.rda
文件,然后使用system()
运行通过rscript加载数据对象,运行mclapply()
调用的脚本,然后将输出保存到可以加载回到交互式会话中的文件中。
这是一个广为人知的问题吗?如果问题是,因为mclapply
在每个核心上复制全局环境,是否有一种方法可以确保它只能访问分析所需的变量?
Can the R function parallel::mclapply()
be made to run in a RAM-efficient way in interactive RStudio sessions in situations where large objects reside in the global environment?
I find that when I use mclapply()
to run analyses on multiple cores, the RAM consumed is always substantially (tens of GB, in my case) higher when running in an interactive RStudio session than when I run the exact same code via Rscript. My hunch is that this is because mclapply()
duplicates the global environment on each core (I often have objects tens of gigabytes in size residing in the global environment), and supplying only the essential objects to the Rscript minimises this overhead.
I am using Linux AWS EC2 machines with large amounts of RAM (e.g., 64 GB to 128 GB) and reasonably large numbers of CPU cores (e.g., 16–32), and I often find that running mclapply
on detectCores() - 1
interactively maxes out the RAM almost instantly (increasing by many tens of GB in seconds), whereas running the exact same code via Rscript uses barely any more RAM than was consumed before mclapply()
was called. I have observed this behaviour for a wide range of unrelated analyses --- hence the fact that I'm not including a reproducible example.
To run the mclapply
call via Rscript, I first save the necessary data objects to an .rda
file, and then use system()
to run a script via Rscript that loads the data objects, runs the mclapply()
call, and then saves the output to a file that can be loaded back into the interactive session.
Is this a widely-known problem? If the problem is because of mclapply
copying the global environment on each core, is there a way to ensure that it can only access the variables necessary for the analysis?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论