自定义文件格式的高效查找方法
我一直想知道在不同的文件格式中实现搜索的方式是什么,以及构建包含大量数据的文件以实现高效搜索的好方法是什么。我考虑过的一些方法是拥有相同大小的数据包,这允许快速跳过,因为您知道每个数据块是什么样的,而且每当加载文件时进行预索引也是一个想法。
I've been wondering what kind of ways seek is implemented across different file formats and what would be a good way to construct a file that has a lot of data to enable efficient seeking. Some ways I've considered have been having equal sized packets, which allows quick skipping since you know what each data chunk is like, also preindexing whenever a file is loaded is also a thought.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这完全取决于数据的类型以及您想要寻求的内容。
如果您尝试按记录索引进行查找,那么可以肯定的是:固定大小的字段使生活更轻松,但浪费了空间。如果您尝试通过其他方式进行搜索,那么保留 key:location 的索引效果很好。如果您希望能够按顺序构建文件,您可以将索引放在末尾,但保留文件的前四个字节(在幻数或其他内容之后)来表示索引本身的位置(假设您可以重写前四个字节)。
如果您希望能够对可变长度块执行某种二进制切割,那么拥有一种相当有效的方法来检测块的开头会有所帮助 - 正如亚历山大提到的那样,拥有下一个/上一个指针也是如此。
基本上,这都是关于元数据的,真的 - 但正确类型的元数据将取决于数据的类型以及首先进行搜索的用例。
This entirely depends on the kind of data, and what you're trying to seek to.
If you're trying to seek by record index, then sure: fixed size fields makes life easier, but wastes space. If you're trying to seek by anything else, keeping an index of key:location works well. If you want to be able to build the file up sequentially, you can put the index at the end but keep the first four bytes of the file (after the magic number or whatever) to represent the location of the index itself (assuming you can rewrite those first four bytes).
If you want to be able to perform a sort of binary chop on variable length blocks, then having a reasonably efficient way of detecting the start of a block helps - as does having next/previous pointers, as mentioned by Alexander.
Basically it's all about metadata, really - but the right kind of metadata will depend on the kind of data, and the use cases for seeking in the first place.
好吧,为每个块提供相对于下一个块的大小偏移是很常见的,并且允许快速跳过未知数据。另一种方法是在文件开头放置一个索引块,存储文件中所有块的表及其偏移量。程序只需将索引块读入内存即可。
Well, giving each chunk a size offset to the next chunk is common and allows fast skipping of unknown data. Another way would be an index chunk at the beginning of the file, storing a table of all chunks in the file along with their offsets. Programs would simply read the index chunk into memory.