如何在 Haskell 中将树数据结构保存到二进制文件
我正在尝试使用 Haskell 将一个简单(但相当大)的树结构保存到二进制文件中。结构看起来像这样:
-- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int]And here is how I need the data look on disk:
- 每个节点以其子节点的四个 32 位偏移量开始,然后跟随子节点。
- 我不太关心叶子,假设它只是 n 个连续的 32 位数字。
- 出于实用目的,我需要一些节点标签或其他一些附加数据 但现在我也不在乎那么多。
在我看来,Haskeller 在编写二进制文件时的首选是 Data.Binary.Put 库。但我在第 1 个项目符号中遇到了问题。特别是,当我要将节点写入文件时,为了记下子偏移量,我需要知道当前的偏移量和每个子节点的大小。
这不是 Data.Binary.Put 提供的东西,所以我认为这一定是 Monad 变压器的完美应用。但尽管这听起来很酷而且很实用,但到目前为止我还没有成功地采用这种方法。
我问了另外两个问题,我认为这两个问题可以帮助我解决问题 这里和这里。我必须说,每次我都收到非常好的答案,帮助我进一步进步,但不幸的是我仍然无法解决整个问题。
这里是我到目前为止所得到的,它仍然泄漏太多内存而不实用。
我很想拥有使用这种功能方法的解决方案,但也将感激任何其他解决方案。
I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:
-- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int]
And here is how I need the data look on disk:
- Each node starts with four 32-bit offsets to it's children, then follow the childs.
- I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
- For practival purposes I would need some node labels or some other additional data
but right now I don't care about that much neither.
It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.
This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.
I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.
Here is what I've got so far, it still leaks too much memory to be practical.
I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是sclv提出的两遍解决方案的实现。
SizeTree 模仿原始 Tree,它不包含数据,但在每个节点存储 Tree 中相应子节点的大小。
我们需要在内存中保存 SizeTree,因此值得使其更加紧凑(例如用 uboxed 单词替换 Ints)。
通过内存中的 SizeTree,可以以流方式序列化原始树。
重要的是要防止 GHC 将两个通道合并为一个通道(在这种情况下,它将把树保存在内存中)。
在这里,它是通过向函数提供树生成器而不是树来完成的。
Here is implementation of two pass solution proposed by sclv.
SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).
With SizeTree in memory it is possible to serialize original Tree in streaming fashion.
It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory).
Here it is done by feeding not tree but tree generator to the function.
我会考虑两种基本方法。如果整个序列化结构很容易装入内存,您可以将每个节点序列化为惰性字节串,然后仅使用每个节点的长度来计算距当前位置的偏移量。
另一种选择是,在序列化节点后,返回并使用适当的数据重写偏移字段。如果树很大,这可能是唯一的选择,但我不知道支持此功能的序列化库。这将涉及
IO
的工作以及寻找
正确的位置。There are two basic approaches I would consider. If the entire serialized structure will easily fit into memory, you can serialize each node into a lazy bytestring and just use the lengths for each of them to calculate the offset from the current position.
The other option is, after serializing a node, go back and re-write the offset fields with the appropriate data. This may be the only option if the tree is large, but I don't know of a serialization library that supports this. It would involve working in
IO
andseek
ing to the correct locations.我认为你想要的是一个明确的两遍解决方案。第一个将您的树转换为带有大小注释的树。这个过程会强制树,但实际上可以通过打结来完成,而无需任何单子机制。第二遍是在普通的 Put monad 中,并且考虑到大小注释已经计算出来,应该非常简单。
What I think you want is an explicit two pass solution. The first converts your tree into a size annotated tree. This pass forces the tree, but can be done, in fact, without any monadic machinery at all by tying the knot. The second pass is in the plain old Put monad, and given that the size annotations are already calculated, should be very straightforward.
这是使用 Builder< 的实现/a>,它是“二进制”包的一部分。我还没有正确分析它,但根据“top”,它立即分配 108 MB,然后在其余执行过程中保留该内存。
请注意,我还没有尝试读回数据,因此我的大小和偏移计算中可能存在潜在错误。
Here is an implementation using Builder, which is part of the "binary" package. I haven't profiled it properly, but according to "top" it immediately allocates 108 Mbytes and then hangs on to that for the rest of the execution.
Note that I haven't tried reading the data back, so there may be lurking errors in my size and offset calculations.