在 S3 到 EC2 之间复制数据的最快/最佳方式?
我有相当大量的数据(约 30G,分成约 100 个文件),我想在 S3 和 EC2 之间传输:当我启动 EC2 实例时,我想将数据从 S3 复制到 EC2 本地磁盘尽快,当我完成处理后,我想将结果复制回 S3。
我正在寻找一种可以来回快速/并行复制数据的工具。 我已经破解了几个脚本,其中一个做得很好,所以我不是在寻找指向基本库的指针;而是在寻找指向基本库的指针。 我正在寻找快速且可靠的东西。
I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为您最好使用弹性块存储来存储文件而不是 S3。 EBS 类似于 S3 上的“驱动器”,可以安装到 EC2 实例中,而无需每次复制数据,从而允许您在 EC2 实例之间保留数据,而无需每次都写入 S3 或从 S3 读取。
http://aws.amazon.com/ebs/
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
安装s3cmd软件包作为
或
根据您的操作系统
,然后复制数据,这
也可以列出文件。
有关更多详细信息,请参阅此
Install s3cmd Package as
or
depending on your OS
then copy data with this
also ls can list the files.
for more detils see this
对我来说最好的形式是:
来自
PuTTy
For me the best form is:
from
PuTTy
不幸的是,Adam的建议不会起作用,因为他对EBS的理解是错误的(尽管我希望他是对的,并且经常认为自己应该这样工作)...因为EBS与S3无关,但它只会给你EC2 实例的“外部驱动器”是独立的,但可连接到实例。 即使两者之间没有数据传输成本,您仍然需要在 S3 和 EC2 之间进行复制。
您没有提到您的实例的操作系统,因此我无法提供定制信息。 我使用的一个流行的命令行工具是 http://s3tools.org/s3cmd ...它基于 Python,因此,根据根据其网站上的信息,它应该可以在 Win 和 Linux 上运行,尽管我一直在 Linux 上使用它。 您可以轻松地创建一个快速脚本,该脚本使用其内置的“sync”命令,其工作原理与 rsync 类似,并在每次处理完数据时触发它。 您还可以使用递归 put 和 get 命令仅在需要时获取和放置数据。
Cloudberry Pro 等图形工具也有一些适用于 Windows 的命令行选项,您可以设置计划命令。 http://s3tools.org/s3cmd 可能是最简单的。
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
到目前为止,AWS 命令行工具中有一个同步命令,应该可以解决问题:http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
启动时:
aws s3 同步 s3://mybucket /mylocalfolder
关闭前 :
aws s3sync /mylocalfolder s3://mybucket
当然,解决细节总是很有趣,例如。 它如何并行(您能否使其更加并行,并且是否可以更快地管理整个设置的虚拟性质)
顺便说一句,希望您仍在致力于此...或者有人正在这样做。 ;)
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)