我应该如何配置 Amazon EC2 来执行可并行的数据密集型计算?
我有一个高度可并行化的计算密集型项目:基本上,我有一个需要在大表(Postgresql)中的每个观察上运行的函数。该函数本身是一个存储的 python 过程。
Amazon EC2 似乎非常适合该项目。
我的问题是:我应该制作一个已包含数据库的自定义映像 (AMI) 吗?这似乎具有最大限度地减少数据传输并使并行化变得简单的优点:每个图像都可以获得一些指定的索引块来计算,例如,图像 1 获得 1:100,图像 2 101:200 等。实例(大多数操作指南建议的)似乎对我的应用程序没有意义,但我对此很陌生,所以我不确定我的直觉是否正确。
I have a computational intensive project that is highly parallelizable: basically, I have a function that I need to run on each observation in a large table (Postgresql). The function itself is a stored python procedure.
Amazon EC2 seems like an excellent fit for the project.
My question is this: Should I make a custom image (AMI) that already contains the database? This would seem to have the advantage of minimizing data transfers and making parallelization simple: each image could get some assigned block of indices to compute, e.g., image 1 gets 1:100, image 2 101:200 etc. Splitting up the data and the instances (which most how-to guides suggest) doesn't seem to make sense for my application, but I'm very new to this so I'm not confident my intuition is right.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您肯定希望将数据和服务器实例分开,以便在完成实例后保留数据中的更改。您最好的选择是从具有操作系统和操作系统的基本映像开始。选择您想要使用的数据库平台,对其进行自定义以满足您的需求,然后安装一个或多个包含您的数据的 EBS 卷。完成自定义后,您可能还想创建自己的服务器实例,除非您正在做的事情相当简单。
一些有用的链接:
http://docs. amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html
http://developer.amazonwebservices.com/connect/entry。 jspa?categoryID=100&externalID=1663
(您说的是 postgres,但这个 mysql 教程涵盖了您需要记住的相同基本概念)
you will definitely want to keep the data and the server instance separate in order for changes in your data to be persisted when you are done with the instance. your best bet will be to start with a basic image that has the OS & database platform you want to use, customize it to suit your needs, and then mount one or more EBS volumes containing your data. You may also want to create your own server instance once you are finished with your customization, unless what you are doing is fairly straightforward.
some helpful links:
http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663
(you said postgres but this mysql tutorial covers the same basic concepts you'll want to keep in mind)
如果您已经用 Python 实现了该功能,最简单的途径可能是查看 PiCloud,它只是为您提供了一个非常简单的界面,用于在 EC2 上运行 Python 函数,为您处理几乎所有其他事情。它在经济上是否合理将取决于每个函数调用必须发送多少数据以及计算运行需要多长时间。
If you've already got the function implemented in Python, the simplest route might be to look at PiCloud, which just gives you a really easy interface for running a Python function on EC2, handling pretty much everything else for you. Whether it's economically sensible will depend on how much data has to get sent per function call vs how long computations take to run.