长度受限哈夫曼码的包合并算法
下面的解释来自维基百科关于使用包合并的长度限制霍夫曼代码。我不明白,我对此有一些疑问。
- 我们如何包装?
- 我们如何合并?
- 我们如何识别符号位串的长度?
设L为任何码字允许的最大长度。令 p1, …, pn 为符号的频率要编码的字母表。我们首先对符号进行排序,使得 pi ≤ pi +1。为每个符号创建 L 个硬币,面额为 2−1, …, 2−L,每种都是钱币值pi。使用包合并算法选择具有最小钱币价值的硬币集合,其面额总计为 n − 1。令 hi 为数字已选择钱币价值 pi 的硬币。最佳长度受限的霍夫曼编码将使用长度为 hi 的位串对符号 i 进行编码。"< /p>
The explanation below is from Wikipedia about length-limited Huffman codes using package-merge. I can not understand, I have some questions about this.
- how we package?
- how we merge?
- how we recognize the length of bit string for symbols?
Let L be the maximum length any code word is permitted to have. Let p1, …, pn be the frequencies of the symbols of the alphabet to be encoded. We first sort the symbols so that pi ≤ pi+1. Create L coins for each symbol, of denominations 2−1, …, 2−L, each of numismatic value pi. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n − 1. Let hi be the number of coins of numismatic value pi selected. The optimal length-limited Huffman code will encode symbol i with a bit string of length hi."
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许这只是构建霍夫曼代码的另一种方法。您是否看过http://cbloomrants .blogspot.com/2010/07/07-02-10-length-limited-huffman-codes.html? IMO 包合并算法并没有构建霍夫曼树。您想要寻找哥伦布密码。
Maybe it is just an alternative way to build a huffman code. Have you looked at http://cbloomrants.blogspot.com/2010/07/07-02-10-length-limitted-huffman-codes.html? IMO the package-merge algorithm isn't building a huffman tree. You want to look for the Golomb Code.
是的,这只是构建对码字长度有限制的霍夫曼代码的一种方法。
霍夫曼码将字母表中的每个字母编码为可以唯一确定的二进制字符串。例如,如果您的字母表是 {A, B, C},并且 A 比 B 和 C 更常见,则以下编码可以很好地工作:
诸如 0010110 的编码字符串可以被唯一编码,因为编码位串的长度已经由哈夫曼代码定义了(这里——每个以 0 开头的字符串的长度为 1,每个以 1 开头的字符串的长度为 2)。因此字符串解码为 0|0|10|11|0 = AABCA。
现在构造霍夫曼码的“问题”是如何选择编码位串,使得结果编码平均尽可能短。在您的问题中,有一个额外的限制,即任何代码字的长度不能超过L。总体思路是使用较短的字符串来编码更常见的符号。
包合并算法的细节并不重要,因为关键是您使用算法来选择“面额总计 n - 1 的具有最小钱币价值的硬币组”。如果您有面额为 2−1, 2−2, ... 的硬币,并且您想用它们构建总价值 100 美分,您可以考虑这个过程首先有一枚价值 100 的硬币,然后将其分成两个 50 美分 (2−1),然后根据需要继续将硬币分成两半,例如50 美分 + 25 美分 + 12.5 美分 + 12.5 美分。这对应于二叉树的构造;每当你分裂一枚硬币时,你都会在二叉树中创建一个内部节点,并在更深的一层上添加两个叶子。
现在最小化钱币价值的想法是,那些链接到“更高频率”符号的“硬币”使用起来更昂贵,因此您希望更少地分割这些硬币,对应于具有更短的代码。
细节留给读者作为练习。
Yes, it's just a way to build a Huffman code that has limit on codeword length.
A Huffman code encodes every letter of the alphabet with a binary string that can be uniquely determined. For example, if your alphabet is {A, B, C} and A is more common than B and C, the following encoding can work well:
An encoded string such as 0010110 can be uniquely encoded, because the length of the encoding bit string is defined already by the Huffman code (here --- every string that begins with 0 has length 1, and every string that beings with 1 has length 2). So the string decodes to 0|0|10|11|0 = AABCA.
Now the "problem" in constructing Huffman codes is how to select the encoding bit strings so that the resulting encodings are on the average as short as possible. In your problem there is an additional constraint that the length of any code word cannot exceed L. The general idea is to use shorter strings to encode more common symbols.
The details of the package-merge algorithm aren't important, as the key is that you use an algorithm to select "set of coins of minimum numismatic value whose denominations total n - 1". If you have coins with denominations 2−1, 2−2, ..., and you want to build total value of 100 cents out of them, you can think of this process as having first a coin of value 100, and then splitting it to two 50 cents (2−1), and then continuing to split your coins in half as long as you want, e.g. 50 cents + 25 cents + 12.5 cents + 12.5 cents. This corresponds to the construction of a binary tree; whenever you split a coin, you create an internal node in a binary tree and add two leaves on one level deeper.
Now the idea of minimizing the numismatic value is that those "coins" that are linked to "higher frequency" symbols are more expensive to use, so you want to split those coins less, corresponding to having shorter codes.
The details are left as an exercise to the reader.