存储哈夫曼树的有效方法
我正在编写一个霍夫曼编码/解码工具,并正在寻找一种有效的方法来存储为存储在输出文件内部而创建的霍夫曼树。
目前我正在实施两个不同的版本。
- 该函数将整个文件逐字符读取到内存中,并为整个文档构建频率表。 这只需要输出树一次,因此效率并不是那么重要的问题,除非输入文件很小。
- 我使用的另一种方法是读取大约 64 KB 大小的数据块,然后对其进行频率分析,创建一棵树并对其进行编码。 然而,在这种情况下,在每个块之前,我需要输出频率树,以便解码器能够重新构建其树并正确解码编码文件。 这就是效率确实发挥作用的地方,因为我想节省尽可能多的空间。
到目前为止,在我的搜索中,我还没有找到在尽可能小的空间中存储树的好方法,我希望 StackOverflow 社区可以帮助我找到一个好的解决方案!
I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.
Currently there are two different versions I am implementing.
- This one reads the entire file into memory character by character and builds a frequency table for the whole document. This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.
- The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it. However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file. This is where the efficiency does come into place since I want to save as much space as possible.
In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
由于您已经必须实现代码来处理字节组织流/文件之上的按位层,因此这是我的建议。
不存储实际频率,解码不需要它们。 但是,您确实需要实际的树。
因此,对于每个节点,从根开始:
要读取,请执行以下操作:
A叶节点基本上是任何没有子节点的节点。
通过这种方法,您可以在编写输出之前计算输出的确切大小,以确定收益是否足以证明付出的努力是值得的。 这假设您有一个键/值对字典,其中包含每个字符的频率,其中频率是实际出现的次数。
计算伪代码:
树大小的计算考虑了叶子节点和非叶子节点,并且内联节点比字符数少一个。
SIZE_OF_ONE_CHARACTER 将是位数,这两个将为您提供我的树方法+编码数据将占用的总位数。
PATH(c) 是一个函数/表,它将产生从根到树中该字符的位路径。
下面是一个类似 C# 的伪代码来执行此操作,它假设一个字符只是一个简单的字节。
读回它:
一个示例(简化、使用属性等) 节点实现:
这是特定示例的示例输出。
输入:AAAAAABCCCCCCDDEEEEE
频率:
每个字符只有 8 位,因此树的大小将为 10 * 5 - 1 = 49 位。
树可能如下所示:
因此每个字符的路径如下(0 为左,1 为右):
因此计算输出大小:
位编码字节为 12+3+12+6+10 = 43 位,
将其与树中的 49 位相加,输出将为 92 位,即 12 个字节。 与存储原始 20 个未编码字符所需的 20 * 8 字节相比,您将节省 8 个字节。
最终输出(包括开始的树)如下。 流 (AE) 中的每个字符都编码为 8 位,而 0 和 1 只是单个位。 流中的空间只是将树与编码数据分开,并不占用最终输出中的任何空间。
对于注释中的具体示例 AABCDEF,您将得到以下结果:
输入:AABCDEF
频率:
树:
路径:
树:001A1B001C1D01E1F = 59 位
数据:000001100101110111 = 18 位
总和: 59 + 18 = 77 位 = 10 个字节
由于原始数据是 7 个字符,每 8 位 = 56 个字符,因此如此小的数据片段会带来太多开销。
Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
So for each node, starting at root:
To read, do this:
A leaf-node is basically any node that doesn't have children.
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
Pseudo-code for calculation:
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
To read it back in:
An example (simplified, use properties, etc.) Node implementation:
Here's a sample output from a specific example.
Input: AAAAAABCCCCCCDDEEEEE
Frequencies:
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
The tree could look like this:
So the paths to each character is as follows (0 is left, 1 is right):
So to calculate the output size:
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
For the concrete example you have in the comments, AABCDEF, you will get this:
Input: AABCDEF
Frequencies:
Tree:
Paths:
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.
如果你对树的生成有足够的控制,你可以让它做一个规范树(同样的方式 DEFLATE 就是如此),这基本上意味着您在构建树时创建规则来解决任何不明确的情况。 然后,就像 DEFLATE 一样,您实际上需要存储的只是每个字符的代码长度。
也就是说,如果您有上面提到的树/代码 Lasse:
那么您可以将它们存储为:
2, 3, 2, 3, 2
假设您始终使用相同的字符集(例如 ASCII),这实际上足以重新生成霍夫曼表。 (这意味着你不能跳过字母——你必须列出每个字母的代码长度,即使它是零。)
如果你还对位长度(例如 7 位)进行限制,你可以存储这些数字中的每一个都使用短二进制字符串。 所以 2,3,2,3,2 变成 010 011 010 011 010 —— 2 个字节。
如果你想变得真正疯狂,你可以做DEFLATE所做的事情,并为这些代码的长度创建另一个霍夫曼表,并预先存储其代码长度。 特别是因为他们添加了“连续插入零 N 次”的额外代码以进一步缩短时间。
如果您已经熟悉霍夫曼编码,DEFLATE 的 RFC 还不错: http:// /www.ietf.org/rfc/rfc1951.txt
If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATE does, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character.
That is, if you had the tree/codes Lasse mentioned above:
Then you could store those as:
2, 3, 2, 3, 2
And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.)
If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes.
If you want to get really crazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. Especially since they add extra codes for "insert zero N times in a row" to shorten things further.
The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt
分支为 0,叶子为 1。首先遍历树深度以获得其“形状”,
然后使用相同深度优先顺序 AECBD 中的字符的位(阅读时,您将知道从树的形状中期望有多少个字符)树)。 然后输出消息的代码。 然后,您将得到一长串位,您可以将它们分成字符以进行输出。
如果您将其分块,您可以测试为下一个块存储树与仅重用前一个块的树一样高效,并且将树形状设置为“1”作为仅重用前一个块中的树的指示符。
branches are 0 leaves are 1. Traverse the tree depth first to get its "shape"
Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). Then output the codes for the message. You then have a long series of bits that you can divide up into characters for output.
If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk.
该树通常是根据字节频率表创建的。 因此,存储该表,或者仅存储按频率排序的字节本身,然后动态重新创建树。 当然,这假设您正在构建树来表示单个字节,而不是更大的块。
更新:正如 j_random_hacker 在评论中指出的那样,您实际上不能这样做:您需要频率值本身。 当您构建树时,它们会结合起来并向上“冒泡”。 此页面描述了树的方式根据频率表构建。 作为奖励,它还通过提到一种保存树的方法来避免这个答案被删除:
The tree is generally created from a frequency table of the bytes. So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. This of course assumes that you're building the tree to represent single bytes, not larger blocks.
UPDATE: As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. They are combined and "bubble" upwards as you build the tree. This page describes the way a tree is built from the frequency table. As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree:
更好的方法
树:
现在只需导出此表:
您不需要使用相同的二叉树,只需保留计算的树深度,即编码位数。 因此,只需保留按树深度排序的未压缩值 [ABCDEF] 的向量,使用相对索引代替此单独的向量。 现在为每个深度重新创建对齐的位模式:
您立即看到的是,只有每行中的第一个位模式是重要的。 你得到下面的查找表:
这个LUT的大小非常小(即使你的霍夫曼代码可以是32位长,它也只会包含32行),而且事实上第一个模式总是空的,你可以完全忽略它当对其中的模式执行二分搜索时(这里只需要比较 1 个模式即可知道位深度是 2 还是 3,并获取关联数据存储在向量中的第一个索引)。 在我们的示例中,您需要在最多 31 个值的搜索空间(即最多 5 个整数比较)中对输入模式执行快速二分搜索。 这 31 个比较例程可以在 31 种代码中进行优化,以避免所有循环以及在浏览整数二进制查找树时必须管理状态。
所有这张表都适合小的固定长度(对于不长于32位的霍夫曼码,LUT最多只需要31行,上面的另外2列最多填充32行)。
换句话说,上面的 LUT 需要 31 个 32 位大小的整数,32 个字节来存储位深度值:但您可以通过暗示深度列(以及深度 1 的第一行)来避免这种情况:
因此您的 LUT 包含[000, 100, 000(30次)]。 要在其中进行搜索,您必须找到输入位模式位于两个模式之间的位置:它必须低于此 LUT 中下一个位置的模式,但仍高于或等于当前位置的模式(如果两个位置包含相同的模式,当前行将不匹配,输入模式适合下面)。 然后你将分而治之,最多使用 5 个测试(二分查找需要一个带有 5 个嵌入 if/then/else 嵌套级别的代码,它有 32 个分支,到达的分支直接指示不存在的位深度需要存储;然后对第二个表执行一次直接索引查找以返回第一个索引;您可以在解码值向量中添加最终索引)。
一旦您在查找表中找到了一个位置(在第一列中搜索),您将立即获得要从输入中获取的位数,然后是向量的起始索引。 您获得的位深度可用于通过减去第一个索引后的基本位掩码直接导出调整后的索引位置。
总之:永远不要存储链接的二叉树,并且不需要任何循环来执行查找,只需要 5 个嵌套的 if 比较 31 个模式表中固定位置的模式,以及一个包含 31 个整数的表,其中包含起始偏移量解码值向量(在嵌套 if/then/else 测试的第一个分支中,隐含向量的起始偏移量,它始终为零;它也是将采用的最频繁的分支,因为它与最短的代码匹配这是最常见的解码值)。
A better approach
Tree:
Now just derive this table:
You don't need to use the same binary tree, just keep the computed tree depth i.e. the number of encoding bits. So just keep the vector of uncompressed values [A B C D E F] ordered by tree depth, use relative indexes instead to this separate vector. Now recreate the aligned bit patterns for each depth:
What you immediately see is that only the first bit pattern in each row is significant. You get the following lookup table:
This LUT has a very small size (even if your Huffman codes can be 32-bit long, it will only contain 32 rows), and in fact the first pattern is always null, you can ignore it completely when performing a binary search of patterns in it (here only 1 pattern will need to be compared to know if the bit depth is 2 or 3 and get the first index at which the associated data is stored in the vector). In our example you'll need to perform a fast binary search of input patterns in a search space of 31 values at most, i.e. a maximum of 5 integer compares. These 31 compare routines can be optimized in 31 codes to avoid all loops and having to manage states when browing the integer binary lookup tree.
All this table fits in small fixed length (the LUT just needs 31 rows atmost for Huffman codes not longer than 32 bits, and the 2 other columns above will fill at most 32 rows).
In other words the LUT above requires 31 ints of 32-bit size each, 32 bytes to store the bit depth values: but you can avoid it this by implying the depth column (and the first row for depth 1):
So your LUT contains [000, 100, 000(30times)]. To search in it you must find the position where the input bits pattern are between two patterns: it must be lower than the pattern at the next position in this LUT but still higher than or equal to the pattern in the current position (if both positions contain the same pattern, the current row will not match, the input pattern fits below). You'll then divide and conquer, and will use 5 tests at most (the binary search requires a single code with 5 embedded if/then/else nested levels, it has 32 branches, the branch reached indicates directly the bit depth that does not need to be stored; you perform then a single directly indexed lookup to the second table for returning the first index; you derive additively the final index in the vector of decoded values).
Once you get a position in the lookup table (search in the 1st column), you immediately have the number of bits to take from the input and then the start index to the vector. The bit depth you get can be used to derive directly the adjusted index position, by basic bitmasking after substracting the first index.
In summary: never store linked binary trees, and you don't need any loop to perform thelookup which just requires 5 nested ifs comparing patterns at fixed positions in a table of 31 patterns, and a table of 31 ints containing the start offset within the vector of decoded values (in the first branch of the nested if/then/else tests, the start offset to the vector is implied, it is always zero; it is also the most frequent branch that will be taken as it matches the shortest code which is for the most frequent decoded values).
正如其他答案所述,有两种主要方法可以存储霍夫曼代码 LUT。 您可以存储树的几何形状,0 表示节点,1 表示叶子,然后放入所有叶子值,或者您可以使用规范哈夫曼编码,存储哈夫曼代码的长度。
问题是,根据具体情况,一种方法比另一种方法更好。
假设您要压缩的数据中唯一符号的数量(
aabbbcdddd
,有 4 个唯一符号,a、b、c、d
)为n
。沿着树中的符号存储树的几何形状的位数是
10n - 1
。假设您按照代码长度所代表的符号的顺序存储代码长度,并且代码长度为 8 位(256 个符号字母表的代码长度不会超过 8 位),则代码长度表的大小将是平坦 2048 位。
当您拥有大量唯一符号(例如 256 个)时,将需要 2559 位来存储树的几何形状。 在这种情况下,代码长度表的效率要高得多。 确切地说,效率提高了 511 位。
但是,如果只有 5 个唯一符号,则树几何结构只需要 49 位,在这种情况下,与存储代码长度表相比,存储树几何结构几乎要好 2000 位。
当 n
n < 时,树的几何形状是最有效的。 205
,而代码长度表对于n >= 205
更有效。 那么,为什么不两全其美,并利用两者呢? 在压缩数据的开头有 1 位表示接下来的多少位是否采用代码长度表的格式或哈夫曼树的几何形状。其实为什么不加两个位呢,两个都为0的时候,就没有表了,数据就是未压缩的。 因为有时,你无法获得压缩! 最好在文件开头有一个字节 0x00 告诉解码器不要担心做任何事情。 通过不包含表或树的几何图形来节省空间,并且无需不必要地压缩和解压缩数据来节省时间。
There are two main ways to store huffman code LUTs as the other answers state. You can either store the geometry of the tree, 0 for a node, 1 for a leaf, then put in all the leaf values, or you can use canonical huffman encoding, storing the lengths of the huffman codes.
The thing is, one method is better than the other depending on the circumstances.
Let's say, the number of unique symbols in the data you wish to compress (
aabbbcdddd
, there are 4 unique symbols,a, b, c, d
) isn
.The number of bits to store the geometry of the tree along side the symbols in the tree is
10n - 1
.Assuming you store the code lengths in order of the symbols the code lengths are for, and that the code lengths are 8 bits (code lengths for a 256 symbol alphabet will not exceed 8 bits), the size of the code length table will be a flat 2048 bits.
When you have a high number of unique symbols, say 256, it will take 2559 bits to store the geometry of the tree. In this case, the code length table is much more efficient. 511 bits more efficient, to be exact.
But if you only have 5 unique symbols, the tree geometry only takes 49 bits, and in this case, when compared to storing the code length table, storing the tree geometry is almost 2000 bits better.
The tree geometry is most efficient for
n < 205
, while a code length table is more efficient forn >= 205
. So, why not get the best of both worlds, and use both? Have 1 bit at the start of your compressed data represent whether the next however many bits are going to be in the format of a code length table, or the geometry of the huffman tree.In fact, why not add two bits, and when both of them are 0, there is no table, the data is uncompressed. Because sometimes, you can't get compression! And it would be best to have a single byte at the beginning of your file that is 0x00 telling your decoder not to worry about doing anything. Saves space by not including the table or geometry of a tree, and saves time, not having to unnecessarily compress and decompress data.