贪婪地编码的霍夫曼树的时间复杂性
编辑以澄清这个问题
我将在Uni上使用两种众所周知的压缩算法(Huffman Coding和Lempel-Ziv 77)进行实验室项目。我对霍夫曼编码的实施类似于贪婪的方法,在该方法中,树是通过以下步骤构建的:
1. Calculate frequencies for all unique characters and place them in a minimum heap
2. While there are more than two nodes in heap:
2.1 Take a value from the minimum heap and place it as the left child
2.2 Take a value from the minimum heap and place it as the right child
2.3 Create a parent node with a frequency of 'frequency of the left child + frequency
of the right child'
2.4 Place the parent node to the minimum heap and continue
3. Finalize the tree with a root node (children are the last two nodes in the minimum heap)
我能够在两种算法的所有其他步骤的时间复杂性上找到良好的来源,除了 霍夫曼编码的减压阶段。我目前的理解是,即使霍夫曼树的实现是不平衡的,在最坏的情况下,长度为n-1(n等于唯一字符的计数),遍历时代也与不同的频率平衡人物。
在不平衡的霍夫曼树中,具有较高频率的叶子节点更为常见,并且具有较短的遍历路径。这可以平衡总时间成本,而我的直觉指出,总时间将接近O(k log n)的时间复杂性,其中k是未压缩内容的长度,而n huffman树中的独特字符。
如果我有可靠的来源可以参考减压阶段的时间复杂性支持我的直觉或反驳它,我会感到更加舒适。如果有人在涵盖这个特定问题的书籍或不罚款的书籍上有很好的提示,我将非常感谢您的帮助!我在项目中付出了很多努力,即使我没有及时找到答案,我也不会认为这个细节会造成很大的凹痕。我主要是对这个主题超级兴奋,并希望尽可能多地学习。
以防万一有人在一个类似的问题上偶然发现这个问题,这是我的项目。请记住,这是一个学生项目,在审查它的同时保持批判性思维是一件好事。
Edited to clarify the question
I'm about to turn in a laboratory project at Uni on two well known compression algorithms (Huffman coding and Lempel-Ziv 77). My implementation of Huffman coding is similar to the greedy approach, in which the tree is built with the following steps:
1. Calculate frequencies for all unique characters and place them in a minimum heap
2. While there are more than two nodes in heap:
2.1 Take a value from the minimum heap and place it as the left child
2.2 Take a value from the minimum heap and place it as the right child
2.3 Create a parent node with a frequency of 'frequency of the left child + frequency
of the right child'
2.4 Place the parent node to the minimum heap and continue
3. Finalize the tree with a root node (children are the last two nodes in the minimum heap)
I've been able to find good sources on the time complexities of all other steps for both algorithms, except for the decompression phase of Huffman coding. My current understanding is that even though this implementation of Huffman tree is unbalanced and in the worst case has the longest path of length n-1 (in which n equals the count of unique characters), the traversal times are balanced by the frequencies of different characters.
In an unbalanced Huffman tree the leaf nodes with higher frequencies are more common and have a shorter traversal paths. This balances the total time cost and my intuition states that the total time would approach a time complexity of O(k log n), in which k is the length of the uncompressed content and n the unique characters in the Huffman tree.
I'd feel much more comfortable if I had reliable source to reference regarding the time complexity of the decompression phase that either support my intuition or counters it. If anyone has good tips on books or not-super-difficult-to-read-articles that cover this particular question, I'd very much appreciate the help! I've put quite a lot of effort into my project and I don't think this one detail is going to make a big dent, even if I don't find the answer in time. I'm mainly just super-fascinated by the topic and want to learn as much as possible.
And just in case anyone ever stumbles into this question on a similar question, here's my project. Keep in mind that it's a student project and it's good to keep a critical mind while reviewing it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于 n 的符号,当频率被排序时,生成huffman代码为o( n )。总体上分类为O( n log n ),因此这就是霍夫曼算法的时间复杂性。
示例中的优先队列是进行分类的一种方法。在进行Huffman编码之前,还有其他方法可以进行分类。
通常,LZ77和Huffman编码合并。
Generating a Huffman code is O(n) for n symbols being coded, when the frequencies are sorted. The sorting in general takes O(n log n), so that is the time complexity for the Huffman algorithm.
The priority queue in your example is one way to do the sorting. There are other approaches where the sorting is done before doing Huffman coding.
Generally, LZ77 and Huffman coding are combined.