存储大量小文件:存档与文件系统
我正在创建一个需要大量图像缩略图(~3000,5-25KB)的应用程序。因为速度至关重要,所以我计划在应用程序启动时将这些图像加载到内存中。在运行时,新的缩略图将被下载并添加到集合中。
我可以将它们全部存储在一个文件夹中,但是当程序启动时将数千个文件读取到内存中似乎效率很低。
我的第二个选择是将它们保存在某种(压缩的)存档中。这将使存储本身和加载更加高效(我认为)。然而,新文件会定期添加,这可能不会像将它们保存在文件夹中那么顺利。
在(压缩的)存档中存储小文件的缓存是不是一个坏主意? ZIP 文件是正确的选择吗?使用未压缩的档案会更好吗(如果是的话,哪种)?
所有图像文件均为 JPEG。
提前致谢!
编辑:我正在考虑放弃“在应用程序启动时将所有内容加载到内存中”的事情。这会稍微简化我的问题。我最初的想法是将所有内容都放在一个大文件中,现在似乎不太有利,因为一个目录中的许多文件的问题可以通过散列到子目录中来解决。
I am creating an application that requires a lot of image thumbnails (~3000, 5-25KB). Because speed is essential I plan on loading these images into memory when the application starts. At runtime, new thumbnails will be downloaded and added to the collective.
I could store them all in a folder, but reading thousands of files into memory when a program starts hardly seems efficient.
My second option would be to save them in some kind of (compressed) archive. This would make storage itself and loading more efficient (I think). However, new files will be added regularly, and that will probably not go as smoothly as just saving them in a folder.
Is storing a cache of small files in a (compressed) archive a bad idea or not? Are ZIP files the way to go? Would I be better off using uncompressed archives (and if so, what kind)?
All image files will be JPEG's.
Thanks in advance!
EDIT: I am considering to drop the "load everything into memory on application start" thing. This would simplify my question a little. My initial idea to put everything in one big file now seems less beneficial, since the problem of many files in one directory can be solved by hashing into subdirectories.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
小文件的压缩效果不是特别好,因此您可能无法获得太多压缩效果。
虽然加载文件会很快,因为它们较小,但解压缩会增加时间。您必须进行试验才能知道哪个更快。
我认为真正的问题与文件系统在迭代所有小文件时的效率有关,特别是如果它们都在一个文件夹中。当文件夹包含大量文件时,Windows 因效率极低而臭名昭著。
我会考虑做一些事情,比如将它们写到一个文件中,未压缩,可以流式传输到内存中——也许不一定是连续的内存,因为这可能是一个问题。但我们的想法是将它们全部放在一个文件中。然后编写某种索引,将文件名或其他标识符与可以确定图像在内存中的位置的偏移量联系起来。
可以在最后添加新图像,并适当更新索引。
这并不花哨,但这正是您要避免的。存档甚至文件系统为您提供了强大的功能和灵活性,但代价是效率。当你知道自己想做什么时,有时简单更好。
我会考虑实现一个从文件夹读取文件的解决方案,另一个将文件分为子文件夹和子子文件夹的解决方案,这样任何给定文件夹中的文件都不超过 100 个左右,然后对这些解决方案进行计时,以便您可以进行比较。我认为一个简单的索引文件足够快,您甚至不需要像您建议的那样预先加载图像 - 只需在需要时检索它们并在内存中保留它们即可。
Small files don't compress especially well, so you may not gain much compression.
While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.
I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.
I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.
New images could be added at the end, and the index updated appropriately.
It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.
I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.
所有基于磁盘的存储和大多数数据库都以块的形式分配空间。大容量磁盘上的块可能很大。如果您有 5kb 文件和 32kb 磁盘块,您最终会浪费 85% 的存储空间。
使用存档不会对 jpeg 进行太多压缩,因为 jpeg 编码算法已经做到了这一点。但是,它会节省您存储介质上浪费的空间。它确实使事情变得更加复杂,甚至可能会慢一些。
All disk based storage, and most database, allocate space in chunks. The chunks on large capacity disks can be large. If you have 5kb files and a 32kb disk chunk you end up with 85% wasted space on your storage.
Using an archive won't compress jpeg much because the jpeg encoding algorithm already does that. It will however save you the wasted space on the storage media. It does make things more complicated and perhaps a little slower.
在我看来,我认为 zip 文件方式是一个坏主意,因为加载 zip 文件并解压缩以提取每个图像的过程会减慢一切。
我认为缩略图的目的是本质上很小,因此您的应用程序加硬件可以尽快加载它。所以我相信根据需要加载每个图像是一个更好的主意。
In my opinion I think that the zip file way it´s a bad idea, because you will slowdown everything with the process to load the zip file and unzip it to extract each image.
I think that the purpose of a thumbnail image is that by nature is small so your app plus hardware can load it as fast as possible. So I believe that it is a better idea to load each image as you need it.
好吧,如果您有小的“几何”图片,您可以将它们实现为 javax.swing.Icon 类型的对象,而不是从文件系统加载的图像。
http://download.oracle.com/javase/6 /docs/api/javax/swing/Icon.html
http://download.oracle.com/javase/tutorial/uiswing/components/icon.html
因此,您将实现一个或多个使用 Graphics 绘图将自己绘制到 Graphics 表面上的对象基元,而不是复制像素。
Well, if you have small, "geometric" pictures, you may implement them as objects of type javax.swing.Icon rather than images to load from the filesystem.
http://download.oracle.com/javase/6/docs/api/javax/swing/Icon.html
http://download.oracle.com/javase/tutorial/uiswing/components/icon.html
So you will implement one or more objects which draw themselves onto a Graphics surface using the Graphics drawing primitives, instead of copying pixels.
如果这是一个 Web 应用程序,那么您可以获得的最佳性能提升就是设置良好的 HTTP 缓存标头。每个图像都有一个唯一的 URL(同一图像的不同版本也有不同的 URL)可以设置非常遥远的未来过期标头,因为更改图像会更改导致重新获取的 URL。
我不会压缩,因为JPEG不能很好地压缩,而且它只会消耗CPU时间。
我建议简单地将图像存储到文件系统中,并考虑使用 jawr 等库或实施您自己的缓存策略。
If this is a web-application then the best performance boost you can get is setting good HTTP caching headers. Having a unique URL for every image (also different URLs for different versions of the same image) makes it possible to set VERY far future expire headers, because changing the image changes the URL leading into refetch.
I won't compress, because JPEG cannot be good compressed and it only costs CPU time.
I would recommend to simply store the images into filesystem and consider the use of libraries like jawr or implement your own caching strategy.
我知道这个问题已经得到解答,但我认为除了压缩之外,您还需要更多选择。
虽然 zip 很好,但它对 JPEG 影响不大,因为 JPEG 已经压缩了。
您可能需要考虑的其他事情是:
既然您提到了 JPEG,您可能想使用 JPEGTran。对所有 JPEG 运行 jpegtran。
该工具可以执行无损 JPEG 操作,例如旋转,还可以用于优化和删除图像中的注释和其他无用信息(例如 EXIF 信息)。
jpegtran -copy none -optimize -perfect src.jpg dest.jpg
有关详细信息,请阅读:http://developer.yahoo.com/performance/rules.html# opt_images
对于如何提高网站性能的基本检查,您可以尝试在 Firefox 中安装 YSlow(检测无效代码的插件)。
希望有帮助。
I know this question has already answered but I think you need more options other than zipping.
While zip is good, It's not really affect much for JPEG since JPEG has already compressed.
Other thing you may want to consider is :
Since you mention JPEG, you may want to use JPEGTran.Run jpegtran on all your JPEGs.
This tool does lossless JPEG operations such as rotation and can also be used to optimize and remove comments and other useless information (such as EXIF information) from your images.
jpegtran -copy none -optimize -perfect src.jpg dest.jpg
For the details read : http://developer.yahoo.com/performance/rules.html#opt_images
For the basic examination how to improve your website performance you can try install YSlow ( plugin to detect uneffecient code ) in Firefox.
Hope that helps.