如何强制 haskell 不存储整个字节串?
我出于学术目的在 haskell 上编写了一个小型(相对)应用程序。我正在实施霍夫曼压缩,基于此代码 http://www.haskell.org/haskellwiki/Toy_compression_implementations 。
我的这段代码的变体在这里 https://github.com/kravitz/har/ blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs 它使用惰性字节串。当我实现 RLE 压缩时,一切都很顺利,因为它一步处理输入流。但是霍夫曼处理它两次,结果我在内存中存储了一个评估的字节串,这对于大文件来说是不利的(但对于相对较小的文件,它也在堆中分配了太多空间)。这不仅仅是我的怀疑,因为分析还显示大部分堆被字节串分配占用。
另外,我在文件中序列化流长度,这也可能导致内存中加载完整的字节串。有没有什么简单的方法可以说 ghc 友善并重新评估流几次?
I writing a small (relatively) application on haskell in academic purpose. I'm implementing a Huffman compression, based on this code http://www.haskell.org/haskellwiki/Toy_compression_implementations .
My variant of this code is here https://github.com/kravitz/har/blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs and it uses lazy bytestrings. When I implemented RLE compression everything was smooth, because it process the input stream in one step. But Huffman process it twice and as a result I have an evaluated bytestring stored in memory, which is bad for a big files (but for relatively small files it allocates too much space in heap too). That is not only my suspicion, because profiling also shows that most of the heap eaten by bytestring allocation.
Also I seriallizing a stream length in file, and it also may cause the full bytestring loading in memory. Is there any simple way to say ghc be kindly and re-evaluate stream several times?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以传递计算字节串的内容,然后在每次需要时显式重新计算该值,而不是将字节串传递给编码器。
compress
的参数实际上应该计算该值。简单地用return
包装一个值是行不通的。此外,每次调用 makeInput 都必须实际评估其结果,否则当重新计算输入时,内存中将保留一个惰性的、未评估的输入副本。正如 Barsoap 所说,通常的方法是一次只压缩一个块。
Instead of passing a bytestring to the encoder, you can pass something that computes a bytestring, then explicitly recompute the value each time you need it.
The parameter to
compress
should actually compute the value. Simply wrapping a value withreturn
won't work. Also, each call tomakeInput
must actually have its result evaluated, else there will remain a lazy, un-evaluated copy of the input in memory when the input is recomputed.The usual approach, as barsoap said, is to just compress one block at a time.
(霍夫曼)压缩时的常用方法是将输入分成块并分别压缩每个块,因为无法绕过处理输入两次,一次收集概率分布,一次进行实际压缩。虽然它仍然会消耗内存,但它最多只会消耗恒定的量。
也就是说,您可能想看看 bytestring-mmap,尽管那不会'不能使用标准输入、套接字和其他不受支持 mmap 的文件系统支持的文件描述符。
在收集概率分布后,您还可以从文件中重新读取字节串(同样,前提是您没有从任何类似管道的东西接收它),但这仍然会使您的代码在 1TB 文件上摆脱困境。
The usual approach when (Huffmann-)compressing, as one can't get around processing the input twice, once to collect the probability distribution, and once to do the actual compressing, is to chunk up the input into blocks and compress each separately. While that still eats memory, it only eats, maximally, a constant amount.
That said, you might want to have a look at bytestring-mmap, though that won't work with standard input, sockets, and other file descriptors that aren't backed by a file system which supports mmap.
You can also re-read the bytestring from file (again, provided you're not receiving it from anything pipe-like) after collecting the probability distribution, but that will still make your code bail out on say 1TB files.