为什么包装 Data.Binary.Put monad 会导致内存泄漏？（第二部分）

发布于 2024-10-17 20:30:58 字数 2201 浏览 7 评论 0原文

正如我的上一个问题，我正在尝试将 Data.Binary.Put monad 包装到另一个 monad 中，以便稍后我可以问它诸如“它将写入多少字节”或“文件中的当前位置是什么”之类的问题。

之前，我认为理解为什么它在使用简单的（IdentityT？）包装器时会泄漏内存将引导我解决我的问题。但即使你们帮我解决了这个简单的包装器的问题，用 StateT 或 WriterT 等有用的东西包装它仍然会消耗太多内存（并且通常会崩溃）。

例如，这是我尝试包装它的一种方法，它会泄漏大输入的内存：

type Out = StateT Integer P.PutM ()

writeToFile :: String -> Out -> IO ()
writeToFile path out = BL.writeFile path $ P.runPut $ do runStateT out 0
                                                         return ()

这里是演示该问题的更完整的代码示例。

我想知道的是：

程序内部发生了什么导致内存泄漏？
我能做什么来修复它？

对于我的第二个问题，我认为我应该更详细地解释我想要数据在磁盘上查看的内容：它基本上是一个树结构，其中树的每个节点都表示为其子节点的偏移表（加上一些附加数据）。因此，要计算第 n 个子节点在偏移表中的偏移量，我需要知道 0 到 n-1 子节点的大小加上当前偏移量（为了简化问题，假设每个节点都有固定数量的子节点）。

感谢您的关注。

更新：感谢 nominolo，我现在可以创建一个包裹 Data.Binary.Put 的 monad，跟踪当前偏移量并且几乎不使用内存。这是通过放弃使用 StateT 转换器转而使用使用 Continuations 的不同状态线程机制来完成的。

像这样：

type Offset = Int

newtype MyPut a = MyPut
  { unS :: forall r . (Offset -> a -> P.PutM r) -> Offset -> P.PutM r }

instance Monad MyPut where
  return a = MyPut $ \f s -> f s a
  ma >>= f = MyPut $ \fb s -> unS ma (\s' a -> unS (f a) fb s') s

writeToFile :: String -> MyPut () -> IO ()
writeToFile path put =
  BL.writeFile path $ P.runPut $ peal put >> return ()
  where peal myput = unS myput (\o -> return) 0

getCurrentOffset :: MyPut Int
getCurrentOffset = MyPut $ \f o -> f o o

lift' n ma = MyPut $ \f s -> ma >>= f (s+n)

但是，我在跟踪 MyPut 将在磁盘上写入多少字节方面仍然存在问题。特别是，我需要一个具有如下签名的函数：

getSize :: MyPut a -> MyPut Int

getSize :: MyPut a -> Int

我的方法是将 MyPut monad 包装在 WriterT 变压器中（类似于 this）。但这又开始消耗太多内存。正如 sclv 在 nominolos 答案下的评论中提到的那样，WriterT 以某种方式抵消了延续的影响。他还提到，应该可以直接从我已有的 MyPut monad 获取大小，但我这样做的所有尝试都以不可编译的代码或无限循环结束:-|。

有人可以进一步提供帮助吗？

原文

As in my previous question, I'm trying to wrap the Data.Binary.Put monad into another monad so that later I can ask it questions like "how many bytes it's going to write" or "what is the current position in file".

Before, I thought that understanding why it leaks memory while using a trivial (IdentityT?) wrapper would lead me to solving my problem. But even though you guys have helped me resolve the problem with the trivial wrapper, wrapping it with something usefull like StateT or WriterT still consumes too much memory (and usually crashes).

For example, this is one way I'm trying to wrap it and which leaks memory for big input:

type Out = StateT Integer P.PutM ()

writeToFile :: String -> Out -> IO ()
writeToFile path out = BL.writeFile path $ P.runPut $ do runStateT out 0
                                                         return ()

Here is a more complete code sample that demonstrates the problem.

What I would like to know is this:

What is happending inside the program that causes the memory leak?
What can I do to fix it?

For my second question I think I should explain in more details what I intend the data to look on disk: It is basically a tree structure where each node of the tree is represented as an offset table to it's children (plus some additional data). So to calculate offset of n-th children into the offset table I need to know the sizes of children 0 to n-1 plus the current offset (to simplify things, let's say each node has fixed number of childs).

Thanks for looking.

UPDATE:
Thanks to nominolo I can now create a monad that wraps around the Data.Binary.Put, tracks current offset and uses almost no memory. This is done by dropping the use of StateT transformer in favor of a different state threading mechanism that uses Continuations.

Like this:

type Offset = Int

newtype MyPut a = MyPut
  { unS :: forall r . (Offset -> a -> P.PutM r) -> Offset -> P.PutM r }

instance Monad MyPut where
  return a = MyPut $ \f s -> f s a
  ma >>= f = MyPut $ \fb s -> unS ma (\s' a -> unS (f a) fb s') s

writeToFile :: String -> MyPut () -> IO ()
writeToFile path put =
  BL.writeFile path $ P.runPut $ peal put >> return ()
  where peal myput = unS myput (\o -> return) 0

getCurrentOffset :: MyPut Int
getCurrentOffset = MyPut $ \f o -> f o o

lift' n ma = MyPut $ \f s -> ma >>= f (s+n)

However I still have a problem with tracking how many bytes is MyPut going to write on disk. In particular, I need to have a function with signature like this:

getSize :: MyPut a -> MyPut Int

getSize :: MyPut a -> Int

My aproach was to wrap the MyPut monad inside WriterT transformer (something like this). But that started to consume too much memory again. As sclv mentions in comments under nominolos answer, WriterT somehow cancels out the effect of continuations. He also mentions that getting the size should be possible directly from the MyPut monad that I already have, but all my attempts to do so ended in non compilable code or an infinite loop :-|.

Could someone please help further?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠意 2024-10-24 20:30:59

我开始研究这个并意识到更大的问题是什么——你的算法非常复杂。您不是计算每个子树的大小一次，而是每次调用 getSize 时计算一次。然后递归调用 getSize。对于每个叶节点，每次在其父节点上调用 getSize 时，都会调用一次 getSize。并且每个父级上的 getSize 都会为自己调用一次 + 每次在任何一个父级上调用 getSize 时都会调用一次 getSize 。所以 getsize 至少在树的深度上被几何地调用。您需要缓存大小以获得类似于合理运行时间的东西。

也就是说，这是核心功能的一个版本，它似乎可以正常运行而没有泄漏，尽管由于上述原因它确实在爬行：

type MyPut = S (Offset,Size) P.PutM

peal_1 :: (Monad m, Num t, Num t1) => S (t, t1) m a -> m a
peal_1 put = unS put (\o -> return) (0,0)

writeToFile :: String -> MyPut () -> IO ()
writeToFile path put =
  BL.writeFile path $ P.runPut $ (peal_1 put) >> return ()

getSize :: MyPut a -> MyPut Int
getSize x = S $ \f os -> unS (x >> getCurrentSize) f os

getCurrentOffset :: MyPut Int
getCurrentOffset = S $ \f os -> f os (fst os)

getCurrentSize :: MyPut Int
getCurrentSize = S $ \f os -> f os (snd os)

我还不得不说我不确定您的逻辑总体上是否正确。我的代码在修复泄漏的同时保留了当前的行为。我通过在缩减的数据集上运行它和您的代码并生成逐位相同的文件来对此进行测试。

但是对于你的大测试数据，这段代码在我杀死它之前写了6.5G（提供的代码在那之前就耗尽了堆）。我怀疑但尚未测试 put monad 中的底层调用对于每次调用 getSize 都会运行一次，即使 getSize 的结果被丢弃。

我提出的正确解决方案已发布作为您其他问题的答案：如何在 Haskell 中将树数据结构保存到二进制文件

I started playing with this and realized what the bigger problem is -- your algorithm has terrible complexity. Rather than computing the size of each child tree once, you compute it once for each time you call getSize. And you call getSize recursively. For each leaf node, getSize is called once for each time getSize is called on its parent. And getSize is called on each parent once for itself + once for each time getSize is called on any of its parents. So getsize is called at least geometrically in the depth of the tree. You need to cache the sizes to get something resembling a reasonable runtime.

That said, here's a version of the core functions that appears to run properly without a leak, although it's really crawling along for the reasons stated above:

type MyPut = S (Offset,Size) P.PutM

peal_1 :: (Monad m, Num t, Num t1) => S (t, t1) m a -> m a
peal_1 put = unS put (\o -> return) (0,0)

writeToFile :: String -> MyPut () -> IO ()
writeToFile path put =
  BL.writeFile path $ P.runPut $ (peal_1 put) >> return ()

getSize :: MyPut a -> MyPut Int
getSize x = S $ \f os -> unS (x >> getCurrentSize) f os

getCurrentOffset :: MyPut Int
getCurrentOffset = S $ \f os -> f os (fst os)

getCurrentSize :: MyPut Int
getCurrentSize = S $ \f os -> f os (snd os)

I also have to say I'm not sure if your logic is correct in general. My code preserves the current behavior while fixing the leak. I tested this by running it and your code on a cut-down data set and producing files that are bit-for-bit identical.

But for your large test data, this code wrote 6.5G before I killed it (the provided code exhausted heap well before then). I suspect but have not tested that the underlying calls in the put monad are getting run once for each call to getSize, even though the result of getSize is getting thrown away.

My proposed proper solution is posted as an answer to your other question: How do you save a tree data structure to binary file in Haskell

回复收藏 0 原文

枯叶蝶 2024-10-24 20:30:58

看起来 monad 转换器太懒了。您可以通过运行程序来创建堆配置文件（无需专门构建）：

$ ./myprog +RTS -hT
$ hp2ps myprog.hp
$ open hp2ps.ps    # Or whichever viewer you have

在这种情况下，它不是特别有用，因为它只显示大量 PAP，FUN_1_0 和 FUN_2_0。这意味着堆由许多部分应用的函数以及一个参数和两个参数的函数组成。这通常意味着某些事情没有得到充分的评估。 Monad 转换器因此而有些臭名昭著。

解决方法是使用更严格的 monad 转换器，使用连续传递样式。（他需要 {-# LANGUAGE Rank2Types #-}。

newtype MyStateT s m a =
  MyStateT { unMyStateT :: forall r. (s -> a -> m r) -> s -> m r }

延续传递风格意味着我们不是直接返回结果，而是使用我们的结果调用另一个函数，延续，实例定义看起来有点有趣，请阅读上面的链接（维基百科）。

instance Monad m => Monad (MyStateT s m) where
  return x = MyStateT (\k s -> k s x)
  MyStateT f >>= kk = MyStateT (\k s ->
    f (\s' a -> unMyStateT (kk a) k s') s)

runMyStateT :: Monad m => MyStateT s m a -> s -> m (a, s)
runMyStateT (MyStateT f) s0 = f (\s a -> return (a, s)) s0

instance MonadTrans (MyStateT s) where
  lift act = MyStateT (\k s -> do a <- act; k s a)

type Out = MyStateT Integer P.PutM ()

在这种情况下，位）：

$ ./so1 +RTS -s 
begin
end
   8,001,343,308 bytes allocated in the heap
     877,696,096 bytes copied during GC
          46,628 bytes maximum residency (861 sample(s))
          33,196 bytes maximum slop
            2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 14345 collections,     0 parallel,  3.32s,  3.38s elapsed
Generation 1:   861 collections,     0 parallel,  0.08s,  0.08s elapsed

使用如此严格的转换器的缺点是您无法再定义 MonadFix 实例，并且某些惰性技巧不再起作用。

It looks like the monad transformer is too lazy. You can create a heap profile (without having to build it specially) by running the program with:

$ ./myprog +RTS -hT
$ hp2ps myprog.hp
$ open hp2ps.ps    # Or whichever viewer you have

In this case it's not particularly helpful, because it only shows lots of PAPs, FUN_1_0s and FUN_2_0s. This means the heap is made up of lots of partially applied functions, and functions of one argument and two arguments. This usually means that something is not evaluated enough. Monad transformers are somewhat notorious for this.

The workaround is to use a more strict monad transformers using continuation passing style. (his requires {-# LANGUAGE Rank2Types #-}.

newtype MyStateT s m a =
  MyStateT { unMyStateT :: forall r. (s -> a -> m r) -> s -> m r }

Continuation passing style means that instead of returning a result directly, we call another function, the continuation, with our result, in this case s and a. The instance definitions look a bit funny. To understand it read the link above (Wikipedia).

instance Monad m => Monad (MyStateT s m) where
  return x = MyStateT (\k s -> k s x)
  MyStateT f >>= kk = MyStateT (\k s ->
    f (\s' a -> unMyStateT (kk a) k s') s)

runMyStateT :: Monad m => MyStateT s m a -> s -> m (a, s)
runMyStateT (MyStateT f) s0 = f (\s a -> return (a, s)) s0

instance MonadTrans (MyStateT s) where
  lift act = MyStateT (\k s -> do a <- act; k s a)

type Out = MyStateT Integer P.PutM ()

Running it now gives constant space (the "maximum residency" bit):

$ ./so1 +RTS -s 
begin
end
   8,001,343,308 bytes allocated in the heap
     877,696,096 bytes copied during GC
          46,628 bytes maximum residency (861 sample(s))
          33,196 bytes maximum slop
            2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 14345 collections,     0 parallel,  3.32s,  3.38s elapsed
Generation 1:   861 collections,     0 parallel,  0.08s,  0.08s elapsed

The downside of using such strict transformers is that you can no longer define MonadFix instances and certain laziness tricks no longer work.

回复收藏 0 原文

~没有更多了~