霍夫曼代码中字符的单位代码的条件?

发布于 2024-09-06 13:40:46 字数 141 浏览 4 评论 0原文

这是我在学校遇到的一个问题,但它一直困扰着我,所以我决定在这里问这个问题。

在霍夫曼压缩中,固定长度序列(字符)用可变长度序列进行编码。代码序列长度取决于源字符的频率(或概率)。

我的问题是:该字符将由一位编码的最小最高字符频率是多少?

This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.

In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.

My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心清如水 2024-09-13 13:40:46

结果发现答案是0.4,即如果最高出现频率pp >= 0.4,则保证对应字符的1位编码。换句话说,这是一个充分条件。

p≥1/3也是必要条件。即,可能存在0.4>0.4的示例。 p >= 1/3,最短的代码为 1 位,但如果 p <= 1/3 则不会出现这种情况。 1/3

推理这个问题的方法是查看代码树的构造方式,特别是最后 3 个幸存子树的频率。 Johnsen 中出现了一个证明,“关于二进制霍夫曼码的冗余” ,1980(不幸的是这是一个付费链接)。

It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.

It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.

The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).

妞丶爷亲个 2024-09-13 13:40:46

一般来说,大约 50% 的输入符号流必须包含给定符号,霍夫曼才能将其编码为单个位。原因是,由于霍夫曼编码的工作原理(一个符号的编码不能是另一个符号的前缀),通过使用单个位对符号进行编码,您需要每个其他符号的第一位 是相反的值(即,如果一个符号被编码为 0,则其他所有符号都必须以 1 开头,再加上至少一位)。由于对于任何给定的位长度,您要消除一半的可能编码空间,因此您需要找到一种方法来对至少一半的输入符号进行编码,以便实现收支平衡。

请注意,有一种特殊情况,即符号空间仅由 3 个符号组成。在这种情况下,无论哪个符号具有最大频率都将使用 1 位进行编码(因为其他两个将是未选择的第一位值的第二位变体) - 如果 2 个或更多具有同样更大的概率,任何一个都可以被编码。因此,在 3 个符号的情况下,一个符号理论上可能有 34% 的概率被编码为单个位(例如 0),而其他两个符号的概率可能为 33% 或更低概率并编码为 1011

因此,如果您考虑所有可能性,那么从技术上讲,任何 1/3 或以上的内容都可能被编码为单个位(在 3 个符号的情况下)。

In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as 0, everything else must start with 1 plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.

Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say, 0) while the other two might have 33% or lower probabilities and be coded as 10 and 11.

So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文