磁盘指针如何工作?
假设我想将一个复杂的数据结构(例如树)存储到磁盘上。连接数据结构中节点的内部指针是指针,但我不能将这些指针写入磁盘,因为当我读回数据结构时,内存位置将发生变化。
那么在磁盘上存储指针的正确方法是什么?答案是否像(文件,偏移量)一样简单,还是我遗漏了一些东西?我可以凭直觉知道指针如何转换为(文件,偏移)对,然后再转换回来,但是有一些我应该注意的微妙之处吗?
编辑:我应该提到,我对数据库如何在内部为 b 树执行此操作特别感兴趣。尽管我确实很欣赏基于 XML 的答案,但我提出的问题可能比我应该提出的更笼统。
Suppose I want to store a complicated data structure (a tree, say) to disk. The internal pointers which connect nodes in my data structures are pointers, but I can't just write these pointers to disk, because when I read the data structure back the memory locations will have changed.
So what is the right way to store the pointers on disk? Is the answer as simple as (File, Offset), or is there something that I'm missing? I can intuit how pointers might be converted to (File, offset) pairs, and back again, but are there some subtleties that I should watch out for?
Edit: I should mention that I'm especially interested in how a database would do this internally, for a b-tree. I probably made the question more general than I should have, though I do appreciate the XML-based answers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您对(文件,偏移量)对的直觉是正确的。
在磁盘上存储数据时需要注意的一个重要事项是磁盘速度很慢。因此,有一些特殊的数据结构被设计用于在磁盘上存储“可搜索”数据。使用(文件,偏移量)指针访问存储在磁盘上的二叉搜索树的节点将比访问内存中的节点慢几个数量级。
如果访问速度很重要,您可能希望将需要一起访问的内容存储在磁盘上,距离更近。用于此目的的几个数据结构是 B-tree 和 B+ 树。查看这些,了解如何使用它们。数据库等多个应用程序使用复杂的缓存算法来将内容缓存在内存中,以便应用程序不需要一次又一次地访问磁盘来检索内容。
如果访问速度并不重要,那么按照 Aiden 和 Darren 的建议,简单地以 XML 形式“序列化”磁盘上的数据就足够了。
编辑:如果您需要有关数据库如何在磁盘上存储数据的更多详细信息,您需要了解有关数据库理论的更多信息。我建议阅读一本关于数据库的好书,以便您了解驱动磁盘格式的要求。请注意,我主要指的是关系 数据库在这里,但还有其他 数据库,其完全具有 不同的要求因此不同的磁盘格式。不过,从关系数据库开始是一件好事,因为它们是最常用的。
简而言之,影响关系数据库磁盘格式的一些因素是:
查询优化是数据库理论的一个重要分支,用于优化磁盘访问,以满足查询。希望这能让您开始正确的方向。
Your intutuion about (file, offset) pairs is correct.
An important thing to watch out for when storing data on disks is that, disks are slow. So, there are special data structures which have been designed to store "searchable" data on disks. Accessing nodes of a binary search tree stored on disks using (file, offset) pointer would be orders of magnitude slower than accessing them in memory.
If speed of access is important, you'd want to store things which are expected to accessed together, closer together on disks. A couple of data structures used for this are B-tree and B+ tree. Look these up, to find out how to use them. There are complicated caching algorithms used by several applications such as databases, to cache things in memory, so that apps do not need to go to disk to retrieve stuff again and again.
If speed of access is not important, then simply "serializing" data on disk in the form of XML as suggested by Aiden and Darren is good enough.
Edit: If you need more details about how databases store data on disk, you'd need to learn more about database theory. I'd suggest reading up a good book on databases, so that you understand the requirements that drive the disk format. Note that I am mostly referring to relational databases here, but there are other breeds of databases, which have completely different requirements and hence different disk formats. Starting with relational databases is a good thing to do though, since they are most commonly used.
In short a few things that affect relational database disk format are:
Query optimization is an important branch of database theory to optimize disk accesses, for satisfying a query. Hopefully, this will get you started in the right direction.
反正你喜欢就好。您可以将其存储为对每个节点的文件系统顶部的其他文件的引用,或者编写使用块引用的文件系统驱动程序。
提供:
您可以按照您希望的方式进行操作。 文件系统是使用基于磁盘的索引节点系统的树。
您始终可以使用带有标头的单个文件,并使用存储为无符号整数或映射到整数的值的字节偏移量。在文件内表示某个节点的开始...然后在每个节点的末尾有一个记录结束。
您还可以使用 XML 文件
对其他位置或单个文件和 XPath/XPointers 的引用。
但这意味着将您的值序列化为字符,如果它们只是二进制 blob (eww) 您的值可能是刚刚写入文件的二进制块的路径,例如:
检查从 XML 封装到用 C 编写的文件系统的任何内容
整个树实现范围。
这个 XML 解决方案可能有点臃肿,但是如果您不需要速度的话,它就足够简单了。只是高级方法的一个示例。树木存储是一个古老的问题,有各个层面的解决方案。
树就是树。
Anyway you like. You could store it as references to other files on-top of a filesystem for each node, or write a filesystem driver that uses block references.
Providing:
You can do it any way you wish. Filesystems are trees that use a disk-based inode system.
You could always use a single file with a header and use byte-offsets stored as unsigned ints or values that map onto ints. inside the file to denote the start of some node ... then have an end-of-record at the end of each node.
You could also use XML files with
references to other locations or a single file and XPath/XPointers.
But this would mean serializing your values into characters if they are just binary blobs (eww) Your value could be a path of a binary chunk just written to a file such as:
Check out anything from XML encapsulation through to filesystems written in C for a
whole gamut of tree implementations.
This XML solution might be bloated, but is simple enough if you don't need speed. Just an example of a high-level approach. Tree storage is an age-old problem, with solutions at all levels.
Trees is trees.
确切地说,存储指针值是没有意义的。
您应该创建一种文本或二进制格式,将数据保存在树结构中。
我建议阅读嵌套集模型,这是另一个例子关于在关系数据库中存储树数据结构。
例如,您的数据的存储方式如下:
这只是一个示例,使用 JSON(推荐)或 XML 可能更好&更轻松。
Exactly, storing pointers value would be meaningless.
You should create a textual or binary format that will hold the data in a tree structure.
I suggest reading about the Nested Set Model, which is another example about storing tree data structure in a relational database.
For example, this is how your data may be stored:
This is only an example, and using JSON (recommended) or XML maybe better & easier.
二进制或文本是第一个问题
历史上应用程序使用复杂的二进制格式来存储结构化数据,但当前的趋势是定义基于文本的表示形式,因为这会产生更多开发人员和用户友好的文件。
XML 是作为一种保存和交换结构化数据的可移植方式而创建的。
如果是我,我会使用类似 XML 但不那么笨重的 YAML。
如果文件可能变得非常大,那么您可以像 OpenOffice 那样,将它们保留为基于文本的标记,但直接写入压缩(我认为它是 OO 的 zip)存档中。
大多数语言已经有序列化库;我确信有一些用于 C 的 Boost 库。通常有多个使用不同表示形式的序列化接口。
如果您使用库、XML 或 YAML,链接将隐含在树结构表示中。如果您的数据有更一般的图表,那么
无论您使用文本还是二进制,您可能都必须规范化链接。这就是你提到的指针问题。解决此问题的一种方法是保留读取或写入文件时使用的临时映射。也就是说,您只需命名每个链接目标,例如 A1、A2、A3 ...,然后将其用作目标处的标记和源处的链接名称(例如 href=)。
我不会使用文件偏移量作为指针,它看起来太脆弱了,使用 XML 或 YAML 或其他已经存在的东西自然是有意义的。
Binary or Text is the first question
Historically applications used complex binary formats for structured data but the current trend is to define a text-based representation as this produces more developer- and user- friendly files.
XML was created as a portable way to persist and interchange structured data.
If it were me, I would use the XML-like but less clunky YAML.
If the files are likely to get really large then you could do what OpenOffice does and keep them as text-based markup but written directly into a compressed (I think it's zip for OO) archive.
Most languages already have serialization libraries; I'm sure there is some Boost library for C. Typically there are multiple serialization interfaces that use different representations.
If you use a library, XML, or YAML, the links will be implicit in the tree-structured representation. If your data has a more general graph, then
whether you use text or binary you may have to normalize links. This is the pointer problem you mentioned. One way to resolve it would be to keep temporary maps that are used when reading or writing the file. That is, you just name every link target, say, A1, A2, A3 ... and then use it as a tag at the destination and as a link name (think href=) at the source.
I would not use file offsets as pointers, it just seems too fragile and naturally it makes sense to use XML or YAML or something else that already exists.
是否可以序列化你的内存树?这听起来像是通过网络发送对象的常见 java 问题。对象具有对其他事物的引用,但是这些指针地址一旦超出程序的地址空间就会发生变化。您能否将树序列化为 XML 或 JSON 形式?
Would it be possible to searialize your in-memory tree? This sounds like the common java problem of sending an object over the network. Objects have references to other things, but those pointer address would change once out of the program's address space. Could you serialize your tree to an XML or JSON form?