如何让 R 利用高内存、高 CPU 服务器的优势?

发布于 2024-09-06 07:20:44 字数 501 浏览 3 评论 0 原文

所以我终于弄清楚了如何让我的 R 脚本在 Amazon EC2 云上运行。我一直使用具有 26 个 ECU、8 个内核和 69 GB RAM 的 AMI。

然后,我将代码分成多个脚本,并在 R 实例中运行每个脚本。使用这种规模的服务器,我可以轻松地同时运行 20-40 个脚本,每个脚本运行多个 1000 个模拟。

我想知道 R 是否本身就利用了所有这些计算能力。我是否应该安装专门告诉 R 使用所有这些额外内存/多个 CPU 的软件包?我看过这个页面和一些packages (至少从描述来看)似乎很有希望。但我无法弄清楚如何将其合并到我的代码中。 有人能对此有更多的了解吗?

SO I've finally figured out how to get my R scripts to run on the Amazon EC2 cloud. I've been using an AMI with 26 ECUs, 8 Cores, and 69 gigs of RAM.

I then divide up my code into multiple scripts, and run each one in an instance of R. With a server of this size, I can easily run 20-40 scripts simultaneously, each running several 1000 simulations.

What I would like to know is if R is taking advantage of all this computing power natively. Should I install packages that specifically tell R to use all this extra memory/ multiple CPUs? I've seen this page and some packages (at least from the description) seem promising. But I am unable to figure out how to incorporate this into my code.
Could anyone shed more light on this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

辞慾 2024-09-13 07:20:44

您可以查看我的《R 高性能计算简介》教程中的示例,其中一些版本位于 此页面

使用多核的最快方法是(优秀)multicore 包,你不应该有什么特别的事情要做,以充分利用那里的大量内存。 多核foreach 通过 doMC,但您当然可以直接使用 mclapply() 函数。

You could look at the examples in my the Intro to High-Performance Computing with R tutorials of which a few versions are on this page.

The quickest way to use the multiple cores is the (excellent) multicore package, you should not have anything special to do to take advantage of the oodles of ram you have there. multicore ties into foreach via doMC, but you can of course simply use the mclapply() function directly.

离不开的别离 2024-09-13 07:20:44

Dirk 的评论在 wrt multicore/foreach/doMC 上很准确。

如果您正在进行数千次模拟,您可能需要考虑 Amazon 的 Elastic MapReduce (EMR) 服务。当我想要扩展 RI 中的模拟时,我从大型 EC2 实例和多核包开始(就像您一样!)。一切都很顺利,但我却收到了巨额 EC2 账单。我并不真正需要那么多内存,但我还是花钱买了它。我的工作会在凌晨 3 点完成,然后我要到上午 8 点才能进入办公室,所以我支付了 5 个小时我不需要的费用。

然后我发现我可以使用 EMR 服务启动 50 个廉价的小型 Hadoop 实例,运行我的模拟,然后让它们自动关闭!我已经完全放弃在 EC2 上运行我的模拟人生,现在几乎完全使用 EMR。这种方法非常有效,以至于我的公司开始测试将更多定期模拟活动迁移到 EMR 的方法。

这是 博客文章是我第一次开始在 EC2 上使用多核时写的。然后,当我发现我可以使用 Amazon EMR 来完成此操作时,我编写了 后续帖子

编辑:自这篇文章以来,我一直在开发一个包,以便更轻松地使用 EMR 和 R 来实现并行应用函数。我已将该项目命名为 Segue,它位于 Google 代码中

进一步更新:我已经弃用了 Segue,因为有更好、更成熟的产品可以从 R 访问 Amazon 的服务。

Dirk's comments are spot on w.r.t multicore/foreach/doMC.

If you are doing thousands of simulations you may want to consider Amazon's Elastic Map Reduce (EMR) service. When I wanted to scale my simulations in R I started with huge EC2 instances and the multicore package (just like you!). It went well but I ran up a hell of an EC2 bill. I didn't really need all that RAM yet I was paying for it. And my jobs would finish at 3 AM then I would not get into the office until 8 AM so I paid for 5 hours I didn't need.

Then I discovered that I could use the EMR service to fire up 50 cheap small Hadoop instances, run my simulations, and then have them automatically shut down! I've totally abandoned running my sims on EC2 and now use EMR almost exclusively. This worked so well that my firm is beginning to test ways to migrate more of our periodic simulation activity to EMR.

Here's a blog post I wrote when I first started using multicore on EC2. Then when I discovered I could do this with Amazon EMR I wrote a follow up post.

EDIT: since this post I've been working on a package for making it easier to use EMR with R for parallel apply functions. I've named the project Segue and it's on Google Code.

Further Update: I've since deprecated Segue because there are much better and more mature offerings for accessing Amazon's services from R.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文