亚马逊网络服务的后台工作
我是 AWS 新手,因此我需要一些有关如何正确创建后台作业的建议。我有一些数据(大约 30GB),我需要:
a)从其他服务器下载;它是一组 zip 存档,其中包含 RSS 提要中的链接
b) 解压缩到 S3
c) 处理每个文件或有时一组解压缩文件,执行数据转换,并将其存储到 SimpleDB/S3
d) 根据 RSS 更新永久重复
有人可以建议 AWS 上正确解决方案的基本架构吗?
谢谢。
丹尼斯
I am new to AWS so I needed some advice on how to correctly create background jobs. I've got some data (about 30GB) that I need to:
a) download from some other server; it is a set of zip archives with links within an RSS feed
b) decompress into S3
c) process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
d) repeat forever depending on RSS updates
Can someone suggest a basic architecture for proper solution on AWS?
Thanks.
Denis
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为在 Elasticbeanstalk 实例上部署您的代码将可以大规模地为您完成这项工作。因为我看到您正在此处处理大量数据,并且使用普通的 EC2 实例可能会耗尽资源(主要是内存)。此外,AWS SQS 批量处理的想法也将有助于优化流程并有效管理服务器端的超时
I think deploying your code on an Elasticbeanstalk Instance will do the job for you at scale. Because I see that you are processing a huge chunk of data here, and using a normal EC2 Instance might max out resources mostly memory. Also the AWS SQS idea of batching the processing will also work to optimize the process and effectively manage time outs on your server-side
我认为您应该运行一个 EC2 实例来执行您需要的所有任务,并在完成后将其关闭。这样您只需为 EC2 的运行时间付费。然而,根据您的架构,您可能需要一直运行它,但是小型实例非常便宜。
I think you should run an EC2 instance to perform all the tasks you need and shut it down when done. This way you will pay only for the time EC2 runs. Depending on your architecture however you might need to run it all the times, small instances are very cheap however.
您可以使用 wget
尝试使用s3-tools (github.com/timkay/aws/raw/master/aws)
编写您自己的 bash 脚本
又一个 bash 脚本来检查更新 + 通过 Cron 运行脚本
You can use wget
Try to use s3-tools (github.com/timkay/aws/raw/master/aws)
Write your own bash script
One more bash script to check updates + run the script by Cron
首先,编写一些执行 a) 到 c) 的代码。测试它等等。
如果您想定期运行代码,那么它是使用后台进程工作流的良好候选者。将作业添加到队列中;当它被认为完成时,将其从队列中删除。大约每隔一小时就会向队列添加一个新作业,这意味着“获取 RSS 更新并解压缩它们”。
您可以使用 AWS Simple Queue Service 或任何其他后台作业处理服务/库手动完成此操作。您可以在 EC2 或任何其他托管解决方案上设置一个工作实例,该实例将轮询队列、执行任务并再次轮询,直到永远。
使用 Amazon Simple Workflow Service 可能更容易,它似乎适合您正在尝试的用途要做(自动化工作流程)。注意:我从未真正使用过它。
First off, write some code that does a) through c). Test it, etc.
If you want to run the code periodically, it's a good candidate for using a background process workflow. Add the job to a queue; when it's deemed complete, remove it from the queue. Every hour or so add a new job to the queue meaning "go fetch the RSS updates and decompress them".
You can do it by hand using AWS Simple Queue Service or any other background job processing service / library. You'd set up a worker instance on EC2 or any other hosting solution that will poll the queue, execute the task, and poll again, forever.
It may be easier to use Amazon Simple Workflow Service, which seems to be intended for what you're trying to do (automated workflows). Note: I've never actually used it.