将数亿个小图像存储到键/值存储或其他 nosql 数据库是个好主意吗?

发布于 2024-10-01 18:32:56 字数 243 浏览 6 评论 0原文

我正在开发一个 Web 系统来处理大量小图像,大约 1 亿张 50kb ~ 200kb 的图像,在 ReiserFS 上工作。

目前,备份和同步这些大量小文件非常困难

我的问题是,将这些小图像存储到键/值存储或其他 nosql 数据库(例如 GridFS (Mongodb) 、Tokyo Tyrant 、Tokyo Tyrant )是否是个好主意>Voldemort 以获得更多性能并带来更好的备份支持?

I am developing a web system to handle a very large set of small images, about 100 millions images of 50kb ~ 200kb, working on ReiserFS.

For now, it is very difficult to backup and sync those large number of small files.

My question is that if it a good idea to store these small images to a key/value store or other nosql database such as GridFS (Mongodb), Tokyo Tyrant, Voldemort to gain more performance and bring better backup support?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

乖乖 2024-10-08 18:32:56

首先,看看这个: 将一百万张图像存储在文件系统。虽然它与备份无关,但它是当前主题的值得讨论的内容。

是的,大量的小文件是令人讨厌的;它们占用索引节点,需要文件名空间等。 (备份所有这些元数据需要时间)。基本上,听起来您已经弄清楚了文件的服务;如果你在 nginx 上运行它,并在前面加上 varnish 等,你很难让它变得更快。在其下添加数据库只会使事情变得更加复杂;在备份方面也是如此。唉,我建议更加努力地制定就地 FS 备份策略。

首先,您是否尝试过使用 -az 开关(分别是归档和压缩)来实现 rsync ?它们往往非常有效,因为它不会一次又一次地传输相同的文件。

或者,我的建议是将 tar + gz 压缩到多个文件中。在伪代码中(假设您将它们放在不同的子文件夹中):

foreach prefix (`ls -1`):
    tar -c $prefix | gzip -c -9 | ssh -z destination.example.tld "cat > backup_`date --iso`_$prefix.tar.gz"
end

这将创建许多 .tar.gz 文件,这些文件可以轻松传输而无需太多开销。

First off, have a look at this: Storing a millon images in the filesystem. While it isn't about backups, it is a worthwile discussion of the topic at hand.

And yes, large numbers of small files are pesky; They take up inodes, require space for filenames &c. (And it takes time to do backup of all this meta-data). Basically it sounds like you got the serving of the files figured out; if you run it on nginx, with a varnish in front or such, you can hardly make it any faster. Adding a database under that will only make things more complicated; also when it comes to backing up. Alas, I would suggest working harder on a in-place FS backup strategy.

First off, have you tried rsync with the -az-switches (archive and compression, respectively)? They tend to be highly effective, as it doesn't transfer the same files again and again.

Alternately, my suggestion would be to tar + gz into a number of files. In pseudo-code (and assuming you got them in different sub-folders):

foreach prefix (`ls -1`):
    tar -c $prefix | gzip -c -9 | ssh -z destination.example.tld "cat > backup_`date --iso`_$prefix.tar.gz"
end

This will create a number of .tar.gz-files that are easily transferred without too much overhead.

送君千里 2024-10-08 18:32:56

另一种选择是将图像存储在 SVN 中,并且实际上将 Web 服务器上的图像文件夹作为图像的 svn 沙箱。这简化了备份,但对性能的净影响为零。

当然,请确保将 Web 服务器配置为不提供 .svn 文件。

Another alternative is to store the images in SVN and actually have the image folder on the web server be an svn sandbox of the images. That simplifies backup, but will have zero net effect on performance.

Of course, make sure you configure your web server to not serve the .svn files.

乖乖兔^ω^ 2024-10-08 18:32:56

如果所有图像(或者至少是访问次数最多的图像)都适合内存,那么 mongodb GridFS 的性能可能会优于原始文件系统。你必须尝试才能找到答案。

当然,根据您的文件系统,将图像分解到文件夹中与否会影响图像。过去我注意到 ReiserFS 更适合在单个目录中存储大量文件。但是,我不知道这是否仍然是最适合这项工作的文件系统。

If all your images, or at least the ones most accessed, fit into memory, then mongodb GridFS might outperform the raw file system. You have to experiment to find out.

Of course, depending on your file-system, breaking up the images into folders or not would affect images. In the past I noticed that ReiserFS is better for storing large numbers of files in a single directory. However, I don't know if thats still the best file system for the job.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文