霍夫曼代码中字符的单位代码的条件?
这是我在学校遇到的一个问题,但它一直困扰着我,所以我决定在这里问这个问题。
在霍夫曼压缩中,固定长度序列(字符)用可变长度序列进行编码。代码序列长度取决于源字符的频率(或概率)。
我的问题是:该字符将由一位编码的最小最高字符频率是多少?
This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.
In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.
My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
结果发现答案是0.4,即如果最高出现频率p为p >= 0.4,则保证对应字符的1位编码。换句话说,这是一个充分条件。
p≥1/3也是必要条件。即,可能存在0.4>0.4的示例。 p >= 1/3,最短的代码为 1 位,但如果 p <= 1/3 则不会出现这种情况。 1/3。
推理这个问题的方法是查看代码树的构造方式,特别是最后 3 个幸存子树的频率。 Johnsen 中出现了一个证明,“关于二进制霍夫曼码的冗余” ,1980(不幸的是这是一个付费链接)。
It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.
It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.
The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).
一般来说,大约 50% 的输入符号流必须包含给定符号,霍夫曼才能将其编码为单个位。原因是,由于霍夫曼编码的工作原理(一个符号的编码不能是另一个符号的前缀),通过使用单个位对符号进行编码,您需要每个其他符号的第一位 是相反的值(即,如果一个符号被编码为
0
,则其他所有符号都必须以1
开头,再加上至少一位)。由于对于任何给定的位长度,您要消除一半的可能编码空间,因此您需要找到一种方法来对至少一半的输入符号进行编码,以便实现收支平衡。请注意,有一种特殊情况,即符号空间仅由 3 个符号组成。在这种情况下,无论哪个符号具有最大频率都将使用 1 位进行编码(因为其他两个将是未选择的第一位值的第二位变体) - 如果 2 个或更多具有同样更大的概率,任何一个都可以被编码。因此,在 3 个符号的情况下,一个符号理论上可能有 34% 的概率被编码为单个位(例如
0
),而其他两个符号的概率可能为 33% 或更低概率并编码为10
和11
。因此,如果您考虑所有可能性,那么从技术上讲,任何 1/3 或以上的内容都可能被编码为单个位(在 3 个符号的情况下)。
In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as
0
, everything else must start with1
plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say,
0
) while the other two might have 33% or lower probabilities and be coded as10
and11
.So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).