如何存储数百万张每张2k左右的图片
我们正在创建一个 ASP.Net MVC 站点,需要存储超过 100 万张图片,大小约为 2k-5k。从之前的研究来看,文件服务器可能比数据库更好(否则请随意评论)。
存储这么多文件时有什么特别需要考虑的吗?如果一个文件夹中有这么多文件,Windows 能否快速找到照片是否存在问题?是否需要创建分段目录结构,例如按文件名划分它们?如果该解决方案能够扩展到至少 1000 万张图片以满足未来潜在的扩展需求,那就太好了。
We're creating an ASP.Net MVC site that will need to store 1 million+ pictures, all around 2k-5k in size. From previous ressearch, it looks like a file server is probably better than a db (feel free to comment otherwise).
Is there anything special to consider when storing this many files? Are there any issues with Windows being able to find the photo quickly if there are so many files in one folder? Does a segmented directory structure need to be created, for example dividing them up by filename? It would be nice if the solution would scale to at least 10 million pictures for potential future expansion needs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
4Kb 是 NTFS 的默认簇大小。您可以根据通常的图片尺寸调整此设置。
http://support.microsoft.com/kb/314878
我将构建一个包含子目录的树能够从一个 FS 移动到另一个 FS:有多少个文件我可以放在目录中吗?
并避免一些问题: http://www.frank4dd.com/howto /various/maxfiles-per-dir.htm
您还可以拥有包含关联图片的存档,以便仅在打开一个文件的情况下加载它们。如果瓶颈是 I/O,这些档案可能会被压缩;如果瓶颈是 CPU,则这些档案可能不会被压缩。
数据库更容易维护,但速度较慢......所以这取决于你!
4Kb is the default cluster size for NTFS. You might tune this settings depending on usual picture size.
http://support.microsoft.com/kb/314878
I would build a tree with subdirectories to be able to move from one FS to another : How many files can I put in a directory?
and avoid some issues : http://www.frank4dd.com/howto/various/maxfiles-per-dir.htm
You can also have archives containing associated pictures to load them with only one file open. Thoses archives might be compressed is the bottleneck is I/O, uncompressed if it's CPU.
A DB is easier to maintain but slower... so it's up to you!
另请参阅此服务器故障问题,了解有关以下内容的一些讨论:目录结构。
See also this Server Fault question for some discussion about directory structures.
问题不在于文件系统无法在目录中存储如此多的文件,而是如果您想使用 Windows 资源管理器访问该目录,则需要很长时间,因此如果您需要手动访问该文件夹,则应该分段例如,每个名称的前 2-3 个字母/数字都有一个目录,甚至更深的结构。
如果您可以将其划分为 1k 个文件夹,每个文件夹有 1k 个文件就足够了,而且执行此操作的代码非常简单。
The problem is not that the filesystem is not able to store so many files in a directory but that if you want to access that directory using windows explorer it will take forever, so if you will ever need to access manually to that folder you should segment it, for example with a directory per each 2-3 first letters/numbers of the name or even a deeper structure.
If you could divide that in 1k folders with 1k files each will be more than enough and the code to do that is quite simple.
假设采用 NTFS,每个卷的文件数限制为 40 亿个 (2^32 - 1)。这是卷上所有文件夹(包括操作系统文件等)的总限制。
单个文件夹中的大量文件应该不是问题; NTFS 使用 B+ 树进行快速检索。 Microsoft 建议您禁用短文件名生成(该功能允许您将 mypictureofyou.html 检索为 mypic~1.htm)。
我不知道将它们分割成多个目录是否有任何性能优势;我的猜测是不会有任何优势,因为 NTFS 是为大型目录的性能而设计的。
如果您决定将它们分割成多个目录,请对文件名使用哈希函数来获取目录名(而不是例如将目录名作为文件名的第一个字母),以便每个子目录具有大致相同的编号文件数量。
Assuming NTFS, there is a limit of 4 billion files per volume (2^32 - 1). That's the total limit for all the folders on the volume (including operating system files etc.)
Large numbers of files in a single folder should not be a problem; NTFS uses a B+ tree for fast retrieval. Microsoft recommends that you disable short-file name generation (the feature that allows you to retrieve mypictureofyou.html as mypic~1.htm).
I don't know if there's any performance advantage to segmenting them into multiple directories; my guess is that there would not be an advantage, because NTFS was designed for performance with large directories.
If you do decide to segment them into multiple directories, use a hash function on the file name to get the directory name (rather than the directory name being the first letter of the file name for instance) so that each subdirectory has roughly the same number of files.
我不会排除使用内容交付网络。它们就是针对这个问题而设计的。我在 Amazon S3 方面取得了很大的成功。由于您使用的是基于 Microsoft 的解决方案,因此 Azure 可能是一个不错的选择。
是否存在某种要求阻止您使用第三方解决方案?
I wouldn't rule out using a content delivery network. They are designed for this problem. I've had a lot of success with Amazon S3. Since you are using a Microsoft based solution, perhaps Azure might be a good fit.
Is there some sort of requirement that prevents you from using a third-party solution?