将 100 万个图像文件移动到 Amazon S3
我运营的图像共享网站拥有超过 100 万张图像(约 150GB)。我目前将这些存储在我的专用服务器的硬盘上,但空间很快就用完了,因此我想将它们移动到 Amazon S3。
我尝试过进行 RSYNC,但 RSYNC 花了一天多的时间来扫描和创建图像文件列表。又经过一天的传输,它只完成了 7%,并且使我的服务器速度慢了下来,所以我不得不取消。
有没有更好的方法来做到这一点,例如将它们 GZIP 到另一个本地硬盘驱动器,然后传输/解压缩该单个文件?
我还想知道将这些文件存储在多个子目录中是否有意义,或者将所有数百万个以上的文件放在同一目录中是否可以?
I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一种选择可能是以惰性方式执行迁移。
这应该很快就能将所有最近或经常获取的图像移至亚马逊,从而减少服务器上的负载。然后,您可以添加另一个任务,以便在服务器最不忙时缓慢迁移其他任务。
One option might be to perform the migration in a lazy fashion.
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
鉴于 S3 上(尚)不存在这些文件,将它们作为存档文件发送应该比使用同步协议更快。
但是,假设图像文件已经以 JPEG 等压缩格式存储,那么压缩存档对图像文件没有太大帮助(如果有的话)。
传输约 150 GB 的数据将长期消耗大量网络带宽。如果您尝试使用 HTTP 或 FTP 而不是 RSYNC 进行传输,情况也会相同。如果可以的话,离线传输会更好;例如,发送一个硬盘、一组磁带或 DVD。
从性能角度来看,将一百万个文件放入一个平面目录中并不是一个好主意。虽然某些文件系统可以通过
O(logN)
文件名查找时间很好地处理这个问题,但其他文件系统则不能使用O(N)
文件名查找时间。将其乘以N
即可访问目录中的所有文件。另一个问题是,如果需要对一百万个文件名进行排序,则需要按文件名顺序访问文件的实用程序可能会显着减慢速度。 (这可以部分解释为什么rsync
需要 1 天的时间来进行索引。)从管理角度来看,将所有图像文件放在一个目录中是一个坏主意;例如,用于备份、归档、移动内容、扩展到多个磁盘或文件系统等。
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with
O(logN)
filename lookup times, others do not withO(N)
filename lookup. Multiply that byN
to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain whyrsync
took 1 day to do the indexing.)Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
您可以使用的一种选择是,将它们放在硬盘驱动器上,然后将其运送到亚马逊的导入,而不是通过网络传输文件/导出服务。您不必担心服务器的网络连接饱和等问题。
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.