用于表示/分配文件中可用空间的数据结构和算法

发布于 2024-10-15 05:59:57 字数 377 浏览 2 评论 0原文

我有一个文件中有“漏洞”，想用数据填充它们；我还需要能够释放“已用”空间并腾出可用空间。

我正在考虑使用映射偏移和长度的双图。但是，如果文件中确实存在微小间隙，我不确定这是否是最好的方法。位图可以工作，但我不知道如何轻松地将其动态切换到某些空间区域。也许某种基数树是正确的选择？

无论如何，我正在了解现代文件系统设计（ZFS、HFS+、NTFS、XFS、ext...），但我发现他们的解决方案严重不足。

我的目标是节省相当多的空间（因此担心小碎片）。如果我不关心这一点，我只会选择两个展开树...一个按偏移量排序，另一个按长度排序，并按偏移量断开关系。请注意，这为您提供了工作设置时间为 log(m) 的所有操作的摊销 log(n)...非常好...但是，如前所述，不能处理有关高碎片的问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若相惜即相离 2024-10-22 05:59:57

我已经发布了可以做到这一点的商业软件。在最新的迭代中，我们最终将文件块排序为“类型”和“索引”，因此您可以读取或写入“类型为 foo 的第三个块”。该文件最终的结构如下：

1）文件头。指向主类型列表。
2）数据。每个块都有一个包含类型、索引、逻辑大小和填充大小的标头。
3) 每个给定类型的（偏移量，大小）元组数组。
4) 跟踪类型的（类型、偏移量、计数）数组。

我们定义它，使每个块都是一个原子单元。你开始写一个新的块，并在开始其他任何事情之前完成它。您还可以“设置”块的内容。开始一个新块总是附加在文件末尾，因此您可以根据需要附加任意数量，而不会造成块碎片。 “设置”一个块可以重复使用一个空块。

当您打开文件时，我们将所有索引加载到 RAM 中。当您刷新或关闭文件时，我们在文件末尾重写每个更改的索引，然后在文件末尾重写索引索引，然后更新前面的标头。这意味着对文件的更改都是原子的——要么提交到更新标头的位置，要么不提交。（有些系统使用相距 8 kB 的标头的两个副本来保留标头，即使磁盘扇区损坏；我们没有采取那么远）

块“类型”之一是“空闲块”。当重写更改的索引以及替换块的内容时，磁盘上的旧空间被合并到保留在空闲块数组中的空闲列表中。相邻的空闲块被合并成一个更大的块。当您“设置内容”或更新类型块索引时，会重新使用空闲块，但不会重新使用索引索引，索引索引始终是最后写入的。

因为索引始终保存在内存中，所以处理打开的文件非常快——通常只需一次读取即可获取单个块的数据（或获取用于流式传输的块的句柄）。开仓和平仓稍微复杂一些，因为它需要加载和刷新指数。如果它成为一个问题，我们可以按需加载辅助类型索引，而不是预先分摊该成本，但这对我们来说从来都不是问题。

持久（磁盘上）存储的首要任务：稳健性！即使在您处理文件时计算机断电，也不会丢失数据！
磁盘存储的第二优先事项：不要执行不必要的 I/O！寻找是昂贵的。在闪存驱动器上，每个单独的 I/O 都很昂贵，而写入则要加倍。尝试对齐和批量 I/O。使用诸如 malloc() 之类的东西进行磁盘存储通常不太好，因为它执行了太多的查找操作。这也是我不太喜欢内存映射文件的一个原因——人们倾向于将它们视为 RAM，然后 I/O 模式变得非常昂贵。

I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:

1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.

We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.

When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)

One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.

Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.

Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.

回复收藏 0 原文