霍夫曼压缩算法
我已经使用霍夫曼算法实现了文件压缩,但我遇到的问题是,要启用压缩文件的解压缩,所使用的编码树或代码本身也应该写入文件。 问题是:我该怎么做? 在压缩文件的开头编写编码树的最佳方法是什么?
I've implemented file compression using huffman's algorithm, but the problem I have is that to enable decompression of the compressed file, the coding tree used, or the codes itself should be written to the file too. The question is: how do i do that? What is the best way to write the coding tree at the beggining of the compressed file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
基本压缩库 (BCL) 中有一个非常标准的霍夫曼编码实现,包括一个递归函数,该函数编写树出到一个文件。 看看霍夫曼。C. 它只是按顺序写出叶子,以便解码器可以重建同一棵树。
BCL 也很好,因为其中还有一些其他非常简单的压缩算法片段。 如果您需要推出自己的算法,这非常方便。
There's a pretty standard implementation of Huffman Coding in the Basic Compression Library (BCL), including a recursive function that writes the tree out to a file. Look at huffman.C. It just writes out the leaves in order so the decoder can reconstruct the same tree.
BCL is also nice because there are some other pretty straightforward compression algorithm pieces in there, too. It's quite handy if you need to roll your own algorithm.
首先,您是否考虑过使用标准压缩流(例如.net中的GZipStream)?
关于如何/在哪里写入数据,您可以使用 Seek 操作 Streams 位置(甚至可以通过这种方式保留空间)。 如果您提前知道树的大小,则可以在该位置之后开始编写。 但是您可能希望将编码树放置在实际数据之后,并确保您知道它从哪里开始。 即前面预留一点空间,写压缩数据,记录位置,写树,到前面把位置写出来。
First off, have you considered using a standard compression Stream (like GZipStream in .net) ?
About how/where to write your data, you can manipulate a Streams position with Seek (even reserve space that way). If you know the size of the Tree ahead of time you can start writing after that position. But you may want to position the coding tree after the actual data, and just make sure you know where that starts. Ie, reserve a little space in front, write the compressed data, record the position, write the tree, go to the front and write out the position.
假设您压缩 8 位符号(即字节)并且算法是非自适应的,最简单的方法是不存储树,而是存储值的分布。 例如,通过存储找到字节 0 的频率、字节 1 的频率、...、字节 255 的频率。然后,在读回文件时,您可以重新组装树。 这是最简单的解决方案,但需要最多的存储空间(例如,要覆盖大文件,每个值需要 4 个字节,即 1kb)。
您可以通过不准确存储每个字节在文件中找到的频率来优化此功能,而是将值标准化为 0..255(0 = 找到最少,...),在这种情况下,您只需要节省256字节。 根据这些值重新组装树将产生相同的树。(这不会像 Edmund 所指出的那样起作用,并且有问题 759707 - 请参阅此处以获取更多链接和问题的答案)PS:正如 Henk 所说,使用eek()可以让您在以下位置保留空间稍后存储值的文件的开头。
Assuming you compress on 8-bit symbols (i.e. bytes) and the algorithm is non-adaptive, the simplest way would be to store not the tree but the distribution of the values. For example by storing how often you found byte 0, how often byte 1, ..., how often byte 255. Then when reading back the file you can re-assemble the tree. This is the simplest solution, but requires the most storage space (e.g. to cover large files, you would need 4 bytes per value, i.e. 1kb).
You could optimize this by not storing exactly how often each byte was found in the file, but instead normalizing the values to 0..255 (0 = found least, ...), in which case you would only need to save 256 bytes. Re-assembling of the tree based on these values would result in the same tree.(This is not going to work as pointed out by Edmund and in question 759707 - see there for further links and answers to your question)P.S.: And as Henk said, using seek() allows you to keep space at the beginning of the file to store the values in later.
大多数实现都使用规范的霍夫曼编码。 您只需以紧凑的方式存储符号长度即可。 实现:shcodec。
另一种方法是使用半静态霍夫曼编码(定期重新缩放),那么您不必存储任何树。
Most implementations are using canonical huffman encoding. You have only to store the symbol lengths in a compact way. Hier an implementation: shcodec.
Another way is using a semi-static huffman encoding (periodic rescale), then you have not to store any tree.
不要将代码树写入文件,而是写入找到每个字符的频率,以便解压程序可以生成相同的树。
Instead of writing the code tree to the file, write how often each character was found, so the decompression program can generate the same tree.
最简单的解决方案是按预先顺序解析压缩树并将 256 个值写入文件的标头中。
The most naive solution would be to parse the compression tree in pre-order and write the 256 values in the header of your file.
由于哈夫曼树中的每个节点要么是具有两个子节点的分支,要么是叶子,因此您可以使用单个位来明确表示每个节点。 对于叶子,紧随其后的是该节点的 8 位。
例如,对于这棵树:
您可以存储 001[B]01[C]1[D]1[A]
(事实证明,这正是之前发布的 huffman.c 示例中发生的情况,但不是上面描述的方式)。
Since every node in a huffman tree is either a branch with two children, or a leaf, you can use a single bit to represent each node unambiguously. For a leaf, follow immediately with the 8 bits for that node.
e.g. for this tree:
You could store 001[B]01[C]1[D]1[A]
(Turns out this is exactly what happens in the huffman.c example posted earlier, but not how it was described above).
最好发送字符的频率并在接收端构建树。 对于固定的字母表,该数据的大小将是恒定的。 我想这必须是可序列化的并放入文件中。 发送树取决于它的实现,对于我所尝试的,基于数组的方法会导致更多的内存未用于树,因为大多数时候树可能不是平衡树。 如果树是平衡的,那么数组表示将生成最佳选择。
哈里桑卡·克里希纳·斯瓦米
it is better to send the frequencies of characters and build the tree at the receiving end. This data will be of constant size for a fixed alphabet. I guess this must be serializable and put in the file. Sending the tree depends on its implementation, for what I have tried, an array based approach leads to more memory left unused for the tree since, the tree may not be a balanced tree most of the time. If the tree was balanced then array representation would have generated the best option.
Harisankar Krishna swamy
您尝试过自适应霍夫曼编码吗? 乍一看,似乎根本不需要发送树,但需要做更多工作来优化和同步树。
Did you try adaptive Huffman coding? From first look it seems the tree need not be sent at all, but more work to optimize and synchronize the tress.