使用 AWS 与 R 进行并行处理

发布于 2024-12-02 05:17:50 字数 605 浏览 0 评论 0原文

我想通过为每个客户构建一个模型来尝试 Kaggle Dunnhumby 挑战。我想将数据分为十组，并使用 Amazon Web 服务 (AWS) 在这十组上使用 R 并行构建模型。我遇到的一些相关链接是：

segue 包；
关于使用 Amazon 并行 Web 服务的演示。

我不明白的是：

如何将数据获取到十个节点中？
如何在节点上发送和执行 R 函数？

如果您能分享建议和提示来为我指明正确的方向，我将非常感激。

PS 我在 AWS 上使用免费使用帐户，但在 Amazon Linux AMI 上从源代码安装 R 非常困难（由于缺少标头、库和其他依赖项而出现很多错误）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

转身泪倾城 2024-12-09 05:17:50

您可以在 AWS 上手动构建所有内容。您必须构建自己的具有多个实例的亚马逊计算机集群。亚马逊网站上有一个很好的教程视频：http://www.youtube.com/watch ?v=YfCgK1bmCjw

但你需要几个小时才能让一切运行起来：

在所有机器上启动 11 个 EC2 实例（每个组一个实例 + 一个头实例）
R 和 MPI（检查预安装的图像）
正确配置 MPI（可能添加安全层）
使用此基础架构安装到所有节点（共享数据），
最好的解决方案是使用 Snow 或 foreach 包（使用 Rmpi）

在最好的情况下，文件服务器将 segue 包很好，但你肯定会遇到数据通信问题！

最简单的解决方案是 cloudnumbers.com (http://www.cloudnumbers.com)。该平台使您可以轻松访问云端计算机集群。您可以使用云端的小型计算机集群免费测试5个小时！查看 userR 会议的幻灯片： http://cloudnumbers.com/hpc-news -来自用户2011年会议

回复收藏 0 原文

随心而道 2024-12-09 05:17:50

我不确定我可以回答有关使用哪种方法的问题，但我可以解释我将如何思考这个问题。我是 Segue 的作者，所以请记住这种偏见:)

在我开始尝试弄清楚如何获取 AWS（或任何其他服务）之前，我会回答一些问题。系统）运行：

训练数据中有多少客户？
训练数据有多大（您将发送到 AWS 的数据）？
将模型适合一位客户的预期平均运行时间是多少...所有运行？
当您将模型适合一位客户时，会生成多少数据（您将从 AWS 返回的数据）？

只看一下训练数据，它看起来并没有那么大（~280 MB）。所以这并不是一个真正的“大数据”问题。如果您的模型需要很长时间才能创建，则可能是“大 CPU”问题，Segue 可能是也可能不是帮助您解决的好工具。

为了回答有关如何将数据传输到 AWS 的具体问题，Segue 通过序列化您提供给 emrlapply() 命令的列表对象，将序列化的对象上传到 S3，然后使用 Elastic Map Reduce 服务流式传输该对象来实现此目的通过 Hadoop。但作为 Segue 的用户，您不需要知道这一点。您只需要调用 emrlapply() 并向其传递您的列表数据（可能是一个列表，其中每个元素都是单个购物者数据的矩阵或数据框）和一个函数（您为适合您选择的模型而编写的函数），Segue 会采用照顾其余的。但请记住，当您调用 emrlapply() 时，Segue 所做的第一件事就是序列化（有时很慢）并将数据上传到 S3。因此，根据数据大小和互联网连接上传速度，这可能会很慢。我对马库斯关于“肯定会遇到数据通信问题”的断言持异议。这显然是FUD。我在随机模拟中使用 Segue，以一定的规律发送/接收 300MB/1GB。但我倾向于从 AWS 实例运行这些模拟，因此我从一个 AWS 机架向另一个机架发送和接收数据，这使得一切都变得更快。

如果您想在 AWS 上进行一些分析并在云中熟悉 R，我推荐 Drew Conway 的 AMI用于科学计算。使用他的 AMI 将使您无需安装/构建太多内容。要将数据上传到正在运行的计算机，设置 ssh 证书后，您可以使用 scp 将文件上传到实例。

我喜欢在我的 Amazon 实例上运行 RStudio。这将需要设置对您的实例的密码访问权限。有很多资源帮助解决这个问题。

I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)

A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:

How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?

Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.

In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.

If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.

I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.

回复收藏 0 原文

~没有更多了~