用于表示/分配文件中可用空间的数据结构和算法
我有一个文件中有“漏洞”,想用数据填充它们;我还需要能够释放“已用”空间并腾出可用空间。
我正在考虑使用映射偏移和长度的双图。但是,如果文件中确实存在微小间隙,我不确定这是否是最好的方法。位图可以工作,但我不知道如何轻松地将其动态切换到某些空间区域。也许某种基数树是正确的选择?
无论如何,我正在了解现代文件系统设计(ZFS、HFS+、NTFS、XFS、ext...),但我发现他们的解决方案严重不足。
我的目标是节省相当多的空间(因此担心小碎片)。如果我不关心这一点,我只会选择两个展开树...一个按偏移量排序,另一个按长度排序,并按偏移量断开关系。请注意,这为您提供了工作设置时间为 log(m) 的所有操作的摊销 log(n)...非常好...但是,如前所述,不能处理有关高碎片的问题。
I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我已经发布了可以做到这一点的商业软件。在最新的迭代中,我们最终将文件块排序为“类型”和“索引”,因此您可以读取或写入“类型为 foo 的第三个块”。该文件最终的结构如下:
1)文件头。指向主类型列表。
2)数据。每个块都有一个包含类型、索引、逻辑大小和填充大小的标头。
3) 每个给定类型的(偏移量,大小)元组数组。
4) 跟踪类型的(类型、偏移量、计数)数组。
我们定义它,使每个块都是一个原子单元。你开始写一个新的块,并在开始其他任何事情之前完成它。您还可以“设置”块的内容。开始一个新块总是附加在文件末尾,因此您可以根据需要附加任意数量,而不会造成块碎片。 “设置”一个块可以重复使用一个空块。
当您打开文件时,我们将所有索引加载到 RAM 中。当您刷新或关闭文件时,我们在文件末尾重写每个更改的索引,然后在文件末尾重写索引索引,然后更新前面的标头。这意味着对文件的更改都是原子的——要么提交到更新标头的位置,要么不提交。 (有些系统使用相距 8 kB 的标头的两个副本来保留标头,即使磁盘扇区损坏;我们没有采取那么远)
块“类型”之一是“空闲块”。当重写更改的索引以及替换块的内容时,磁盘上的旧空间被合并到保留在空闲块数组中的空闲列表中。相邻的空闲块被合并成一个更大的块。当您“设置内容”或更新类型块索引时,会重新使用空闲块,但不会重新使用索引索引,索引索引始终是最后写入的。
因为索引始终保存在内存中,所以处理打开的文件非常快——通常只需一次读取即可获取单个块的数据(或获取用于流式传输的块的句柄)。开仓和平仓稍微复杂一些,因为它需要加载和刷新指数。如果它成为一个问题,我们可以按需加载辅助类型索引,而不是预先分摊该成本,但这对我们来说从来都不是问题。
持久(磁盘上)存储的首要任务:稳健性!即使在您处理文件时计算机断电,也不会丢失数据!
磁盘存储的第二优先事项:不要执行不必要的 I/O!寻找是昂贵的。在闪存驱动器上,每个单独的 I/O 都很昂贵,而写入则要加倍。尝试对齐和批量 I/O。使用诸如 malloc() 之类的东西进行磁盘存储通常不太好,因为它执行了太多的查找操作。这也是我不太喜欢内存映射文件的一个原因——人们倾向于将它们视为 RAM,然后 I/O 模式变得非常昂贵。
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
对于内存管理,我很喜欢 BiBOP* 方法,该方法通常可以有效地管理碎片。
这个想法是根据数据的大小来隔离数据。这样,在“包”中,您只有具有相同大小的小块的“页面”:
包保留可用页面的简单空闲列表。每个页面都在这些单元上覆盖着可用存储单元的空闲列表。
您需要一个索引来将尺寸映射到相应的包。
您还需要对“异常”请求(即要求分配大于页面大小的请求)进行特殊处理。
这种存储空间效率极高,特别是对于小对象,因为开销不是针对每个对象的,但是有一个缺点:您最终可能会得到仍然包含一两个占用的存储单元的“几乎为空”的页面。
如果您有能力“移动”现有对象,则可以缓解这种情况。这有效地允许合并页面。
(*) BiBOP:一大袋页面
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
我建议基于 FUSE 制作自定义文件系统(当然可能包含一个文件)。您可以基于许多可用的 FUSE 解决方案 - 我建议选择不相关但最简单的项目,以便轻松学习。
选择什么算法和数据结构,很大程度上取决于您的需求。它可以是:通过即时压缩/解压缩将映射、列表或文件分割成块。
您提出的数据结构是个好主意。正如您清楚地看到的,存在一个权衡:碎片与压缩。
一方面 - 最好的压实度,最高的破碎度 - 伸展和许多其他种类的树。
另一方面 - 最低的碎片,最差的压缩 - 链表。
介于两者之间的是 B 树和其他树。
据我了解,您所说的优先事项是:节省空间 - 同时关注性能。
我建议您使用混合数据结构来满足所有要求。
该解决方案可为您提供按需快速响应,同时在使用时或在空闲时刻“优化”内容(例如“每次读取 10MB 数据 -> 碎片整理 1MB)”。
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
最简单的解决方案是空闲列表:保留空闲块的链接列表,重用空闲空间存储列表中下一个块的地址。
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.