很多小文件还是几个大文件?
就性能和效率而言,使用大量小文件(我的意思是多达几百万个)还是几个(十个左右)大(几千兆字节)文件更好? 假设我正在构建一个数据库(不完全正确,但重要的是它将被大量访问)。
我主要关心读取性能。 我的文件系统目前在 Linux 上是 ext3(Ubuntu Server Edition,如果有的话),尽管我仍然可以切换,所以不同文件系统之间的比较会很棒。 由于技术原因,我无法为此使用实际的 DBMS(因此出现这个问题),因此“仅使用 MySQL”不是一个好的答案。
提前致谢,如果需要更具体,请告诉我。
编辑:我将存储大量相对较小的数据,这就是为什么使用大量小文件对我来说会更容易。 因此,如果我使用一些大文件,我一次只能检索其中的几 KB。 我还会使用索引,所以这并不是真正的问题。 此外,一些数据指向其他数据片段(在有很多小文件的情况下,它将指向文件,在大文件的情况下,它将指向文件内的数据位置)。
In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).
I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.
Thanks in advance, and let me know if I need to be more specific.
EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这里有很多假设,但是,出于所有意图和目的,搜索大文件比搜索一堆小文件要快得多。
假设您正在查找文本文件中包含的文本字符串。 搜索 1TB 文件将比打开1,000,000 MB 文件并搜索这些文件快得多。
每个文件打开操作都需要时间。 大文件只需打开一次。
而且,在考虑磁盘性能时,单个文件比大量文件更有可能连续存储。
...再次强调,这些只是概括,并不了解您的具体应用的更多信息。
There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.
Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.
Each file-open operation takes time. A large file only has to be opened once.
And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.
...Again, these are generalizations without knowing more about your specific application.
这取决于。 真的。 不同的文件系统以不同的方式进行优化,但一般来说,小文件会被有效地打包。 拥有大文件的优点是您不必打开和关闭很多东西。 打开和关闭是需要时间的操作。 如果您有一个大文件,通常只打开和关闭一次,并且使用查找操作。
如果您选择大量文件解决方案,我建议您使用类似的结构,
因为您对目录中的文件数量有限制。
It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations
If you go for the lots-of-files solution, I suggest you a structure like
because you have limits on the number of files in a directory.
TMO 这里的主要问题是关于索引。 如果您要在没有良好索引的大文件中搜索信息,则必须扫描整个文件以获取可能很长的正确信息。 如果您认为可以构建强大的索引机制,那么就可以使用大文件。
我更愿意将此任务委托给 ext3,它应该相当擅长。
编辑:
根据关于 ext3 的维基百科文章,需要考虑的一件事是,随着时间的推移,碎片确实会发生。 因此,如果您有大量小文件,并且占据了文件系统的很大一部分,那么随着时间的推移,您将损失性能。
该文章还验证了有关每个目录 32k 文件限制的说法(假设维基百科文章可以验证任何内容)
The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.
I'd prefer to delegate this task to ext3 which should be rather good at it.
edit :
A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.
The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)
我相信 Ext3 每个目录有大约 32000 个文件/子目录的限制。 如果您要处理数百万个文件,则需要将它们分布在许多目录中。 我不知道这会对性能产生什么影响。
我的偏好是几个大文件。 事实上,为什么要有几个单元,除非它们是某种逻辑上独立的单元? 如果你仍然只是为了分割而分割,我建议你不要这样做。 Ext3 可以很好地处理非常大的文件。
I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.
My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.
我使用的系统在 Linux 下的 XFS 文件系统上存储了大约 500 万个文件,并且没有出现任何性能问题。 我们只使用文件来存储数据,我们从不完全扫描它们,我们有一个用于搜索的数据库,并且表中的字段之一包含我们用于检索的 guid。 我们使用如上所述的两级目录,文件名作为 guid,但如果文件数量变得更大,则可以使用更多级别。 我们选择这种方法是为了避免在数据库中存储一些额外的 TB,这些数据只需要存储/返回而无需搜索,它对我们来说效果很好。 我们的文件大小从 1k 到大约 500k 不等。
我们还在 ext3 上运行了该系统,并且运行良好,尽管我不确定我们是否曾经将其超过大约一百万个文件。 由于每个目录的最大文件数限制,我们可能需要使用 3 个目录系统。
I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.
We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.