2 个文件,一半内容与 1 个文件,两倍内容,哪个更大?
如果我有 2 个文件,每个文件都包含以下内容:
“你好世界”(x 1000)
是否比 1 个文件占用更多空间:
“你好世界”(x 2000)
将内容划分为多个较小的文件有哪些缺点(假设有理由将它们划分为更多文件,而不是像这个示例)?
更新:
我使用的是 Macbook Pro,10.5。但我也想知道 Ubuntu Linux 的情况。
If I have 2 files each with this:
"Hello World" (x 1000)
Does that take up more space than 1 file with this:
"Hello World" (x 2000)
What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?
Update:
I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
马塞洛斯给出了一般表现情况。我认为担心这一点是不成熟的优化。您应该将内容拆分为不同的文件,其中拆分它们是合乎逻辑的。
另外,如果您确实关心此类重复文件的文件大小,那么您可以压缩它们。
的简单游程长度编码
你的例子甚至暗示了这一点, “Hello World”x1000
比实际写出 1000 次“hello world”要节省空间得多。
Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.
also if you really care about file size of such repetitive files then you can compress them.
your example even hints at this, a simple run length encoding of
"Hello World"x1000
is much more space efficient than actually having "hello world" written out 1000 times.
文件在磁盘上以簇的形式占用空间。簇由多个扇区组成,其大小取决于磁盘的格式化方式。
集群的典型大小为 8 KB。这意味着两个较小的文件将各自使用两个簇(16 KB),而较大的文件将使用三个簇(24 KB)。
一个文件平均会使用比其大小多一半的簇。因此,簇大小为 8 KB 时,每个文件的平均开销为 4 KB。
Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.
A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).
A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.
大多数文件系统使用固定大小的簇(典型的为 4 kB,但不通用)来存储文件。低于此簇大小的文件将占用相同的最小数量。
即使超过此大小,当您有大量小文件时,比例浪费往往会很高。忽略大小分布的偏度(这会使事情变得更糟),总体浪费大约是集群大小乘以文件数量的一半,因此对于给定的数据量,您拥有的文件越少,存储数据的效率就越高。
另一个考虑因素是元数据操作(尤其是文件删除)可能非常昂贵,因此较小的文件也不是您的朋友。 ReiserFS 在这方面做了一些有趣的工作,直到作者因谋杀妻子而入狱(我不知道该项目的当前状态)。
如果可以的话,您还可以调整文件大小以始终填满整个簇,这样小文件就不会成为问题。但这通常太挑剔了,不值得,而且还有其他成本。对于大容量吞吐量,目前最佳文件大小在 64 MB 到 256 MB 之间(我认为)。
实用建议:将您的内容保存在数据库中,除非有充分的理由不这样做。 SQLite 大幅减少了数量的原因。
Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.
Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.
Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).
If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).
Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.
我认为要根据 API 和用于读/写它们的语言(以及最终的 API 限制)来考虑文件的使用。
如果您一次读取一个大文件,则磁盘碎片往往会减少,如果您一次读取一个大文件,则数据访问将会受到影响,而对小文件的多次访问间隔时间不会受到碎片的影响。
I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions).
Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.
大多数文件系统以大于字节的单位分配空间(现在通常为 4KB)。有效文件大小“四舍五入”为该“簇大小”的下一个倍数。因此,分割文件几乎总是会消耗更多的总空间。当然,目录中还有一个额外的条目,这可能会导致它消耗更多的空间,并且许多文件系统都有一个额外的中间层 inode,每个文件消耗一个条目。
Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.