如何在 Haskell 中将树数据结构保存到二进制文件

发布于 2024-10-19 15:57:42 字数 1075 浏览 11 评论 0原文

我正在尝试使用 Haskell 将一个简单（但相当大）的树结构保存到二进制文件中。结构看起来像这样：

-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]

And here is how I need the data look on disk:

每个节点以其子节点的四个 32 位偏移量开始，然后跟随子节点。
我不太关心叶子，假设它只是 n 个连续的 32 位数字。
出于实用目的，我需要一些节点标签或其他一些附加数据但现在我也不在乎那么多。

在我看来，Haskeller 在编写二进制文件时的首选是 Data.Binary.Put 库。但我在第 1 个项目符号中遇到了问题。特别是，当我要将节点写入文件时，为了记下子偏移量，我需要知道当前的偏移量和每个子节点的大小。

这不是 Data.Binary.Put 提供的东西，所以我认为这一定是 Monad 变压器的完美应用。但尽管这听起来很酷而且很实用，但到目前为止我还没有成功地采用这种方法。

我问了另外两个问题，我认为这两个问题可以帮助我解决问题这里和这里。我必须说，每次我都收到非常好的答案，帮助我进一步进步，但不幸的是我仍然无法解决整个问题。

这里是我到目前为止所得到的，它仍然泄漏太多内存而不实用。

我很想拥有使用这种功能方法的解决方案，但也将感激任何其他解决方案。

原文

I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:

-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]

And here is how I need the data look on disk:

Each node starts with four 32-bit offsets to it's children, then follow the childs.
I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
For practival purposes I would need some node labels or some other additional data
but right now I don't care about that much neither.

It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.

This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.

I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.

Here is what I've got so far, it still leaks too much memory to be practical.

I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云雾 2024-10-26 15:57:42

这是sclv提出的两遍解决方案的实现。

import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')

data Tree = Node [Tree] | Leaf [Word32] deriving Show

makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1

SizeTree 模仿原始 Tree，它不包含数据，但在每个节点存储 Tree 中相应子节点的大小。
我们需要在内存中保存 SizeTree，因此值得使其更加紧凑（例如用 uboxed 单词替换 Ints）。

data SizeTree
  = SNode {sz :: Int, chld :: [SizeTree]}
  | SLeaf {sz :: Int}
  deriving Show

通过内存中的 SizeTree，可以以流方式序列化原始树。

putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
  putWord8 $ fromIntegral $ length xs          -- number of children
  mapM_ (putWord32be . fromIntegral . sz) ys   -- sizes of children
  sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
  putWord8 0                                   -- zero means 'leaf'
  putWord32be $ fromIntegral $ length xs       -- data length
  mapM_ putWord32be xs                         -- leaf data


mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
  where
    ys = map mkSizeTree xs
    sum' = foldl' (+) 0

重要的是要防止 GHC 将两个通道合并为一个通道（在这种情况下，它将把树保存在内存中）。
在这里，它是通过向函数提供树生成器而不是树来完成的。

serialize mkTree size = runPut $ putTree (mkTree size) treeSize
  where
    treeSize = mkSizeTree $ mkTree size

main = L.writeFile "dump.bin" $ serialize makeTree 10

Here is implementation of two pass solution proposed by sclv.

import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')

data Tree = Node [Tree] | Leaf [Word32] deriving Show

makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1

SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).

data SizeTree
  = SNode {sz :: Int, chld :: [SizeTree]}
  | SLeaf {sz :: Int}
  deriving Show

With SizeTree in memory it is possible to serialize original Tree in streaming fashion.

putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
  putWord8 $ fromIntegral $ length xs          -- number of children
  mapM_ (putWord32be . fromIntegral . sz) ys   -- sizes of children
  sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
  putWord8 0                                   -- zero means 'leaf'
  putWord32be $ fromIntegral $ length xs       -- data length
  mapM_ putWord32be xs                         -- leaf data


mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
  where
    ys = map mkSizeTree xs
    sum' = foldl' (+) 0

It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory).
Here it is done by feeding not tree but tree generator to the function.

serialize mkTree size = runPut $ putTree (mkTree size) treeSize
  where
    treeSize = mkSizeTree $ mkTree size

main = L.writeFile "dump.bin" $ serialize makeTree 10

回复收藏 0 原文

对不⑦ 2024-10-26 15:57:42

我会考虑两种基本方法。如果整个序列化结构很容易装入内存，您可以将每个节点序列化为惰性字节串，然后仅使用每个节点的长度来计算距当前位置的偏移量。

serializeTree (Leaf nums)  = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
 where
  childBs = map serializeTree subtrees
  offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
  header = runPut (mapM_ putInt32 $ init offsets)

另一种选择是，在序列化节点后，返回并使用适当的数据重写偏移字段。如果树很大，这可能是唯一的选择，但我不知道支持此功能的序列化库。这将涉及IO 的工作以及寻找正确的位置。

There are two basic approaches I would consider. If the entire serialized structure will easily fit into memory, you can serialize each node into a lazy bytestring and just use the lengths for each of them to calculate the offset from the current position.

serializeTree (Leaf nums)  = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
 where
  childBs = map serializeTree subtrees
  offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
  header = runPut (mapM_ putInt32 $ init offsets)

The other option is, after serializing a node, go back and re-write the offset fields with the appropriate data. This may be the only option if the tree is large, but I don't know of a serialization library that supports this. It would involve working in IO and seeking to the correct locations.

回复收藏 0 原文

深巷少女 2024-10-26 15:57:42

我认为你想要的是一个明确的两遍解决方案。第一个将您的树转换为带有大小注释的树。这个过程会强制树，但实际上可以通过打结来完成，而无需任何单子机制。第二遍是在普通的 Put monad 中，并且考虑到大小注释已经计算出来，应该非常简单。

回复收藏 0 原文

心作怪 2024-10-26 15:57:42

这是使用 Builder< 的实现/a>，它是“二进制”包的一部分。我还没有正确分析它，但根据“top”，它立即分配 108 MB，然后在其余执行过程中保留该内存。

请注意，我还没有尝试读回数据，因此我的大小和偏移计算中可能存在潜在错误。

-- Paste this into TreeBinary.hs, and compile with
--    ghc -O2 --make TreeBinary.hs -o TreeBinary

module Main where


import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B

import Data.List (init)
import Data.Monoid
import Data.Word


-- -------------------------------------------------------------------
-- Test data.

data Tree = Node [Tree] | Leaf [Word32] deriving Show

-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)

-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int.  This is an example of lazy programming instead
-- of lazy evaluation. 

makeTree :: Tree
makeTree = makeTree1 9
  where makeTree1 0 = Leaf [0..100]
        makeTree1 n = Node [ makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1 ]

-- --------------------------------------------------------------------
-- The actual serialisation code.


-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches), 
                           baseLength + sum subLengths)
   where
      (branches, subLengths) = unzip $ map serialiseTree ts
      baseLength = fromIntegral $ 1 + 4 * length ts
      offsets = init $ scanl (+) baseLength subLengths


main = do
   putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
   BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

Here is an implementation using Builder, which is part of the "binary" package. I haven't profiled it properly, but according to "top" it immediately allocates 108 Mbytes and then hangs on to that for the rest of the execution.

Note that I haven't tried reading the data back, so there may be lurking errors in my size and offset calculations.

-- Paste this into TreeBinary.hs, and compile with
--    ghc -O2 --make TreeBinary.hs -o TreeBinary

module Main where


import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B

import Data.List (init)
import Data.Monoid
import Data.Word


-- -------------------------------------------------------------------
-- Test data.

data Tree = Node [Tree] | Leaf [Word32] deriving Show

-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)

-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int.  This is an example of lazy programming instead
-- of lazy evaluation. 

makeTree :: Tree
makeTree = makeTree1 9
  where makeTree1 0 = Leaf [0..100]
        makeTree1 n = Node [ makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1 ]

-- --------------------------------------------------------------------
-- The actual serialisation code.


-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches), 
                           baseLength + sum subLengths)
   where
      (branches, subLengths) = unzip $ map serialiseTree ts
      baseLength = fromIntegral $ 1 + 4 * length ts
      offsets = init $ scanl (+) baseLength subLengths


main = do
   putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
   BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

回复收藏 0 原文

~没有更多了~