Shannon-Fano 编码是否有歧义?
简而言之:
Fano 的论文信息传输(1952 年)中描述的Shannon-Fano 编码真的很模糊吗?
详细信息:
3 篇论文
克劳德·E·香农 (Claude E. Shannon) 于 1948 年 7 月发表了他的著名论文通信数学理论。在这篇论文中,他发明了我们今天所知的术语位,并且还定义了我们今天所知道的术语位。今天调用香农熵。他还在本文中提出了一种基于熵的数据压缩算法。但香农的算法太弱了,在某些情况下,“压缩”的消息可能比固定长度编码的消息还要长。几个月后(1949 年 3 月)Robert M. Fano 在论文
- 计算每个字符在消息中出现的频率。
- 按频率对所有字符进行排序,频率最高的字符位于列表顶部
- 将列表分为两部分,使两部分的频率总和尽可能相等。将位
0
添加到一个部分,将位1
添加到另一部分。 - 对包含 2 个或更多字符的每个部分重复步骤 3,直到所有部分仅包含 1 个字符。
- 连接所有轮次的所有位。这是该字符的香农法诺代码。
一个例子
让我们在一个非常小的示例上执行此操作(我认为这是出现问题的最小消息)。以下是要编码的消息:
aaabcde
步骤 1 和 2 生成如下所示的两个表的前 2 列。但如果维基百科对 Fanos 算法的解释是正确的,那么第 3 步是不明确的。如果您在我的示例中应用此步骤,则有两种可能性将列表分为两部分(见下文)。这些可能性产生不同的代码,其本身不值得一提。但要点是:这两种可能性会产生不同长度的代码。
可能性 1
如果有 2 种方法来分割列表,使两个部分尽可能彼此相等,则放置该字符,位于分割点(在我的示例中是字符 b
)到包含低频繁字符的部分
+------+-------+-----+-----+-----+-----+-----+-----+------+
| | | round1 | round2 | round3 | |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
+------+-------+-----+-----+-----+-----+-----+-----+------+
| a | 3 | 3 | 0 | | 0 |
| | +-----+-----+-----+-----+-----+-----+------+
| b | 1 | | | | | 1 | 0 | 100 |
| | | | | 2 | 0 +-----+-----+------+
| c | 1 | | | | | 1 | 1 | 101 |
| | | 4 | 1 +-----+-----+-----+-----+------+
| d | 1 | | | | | 1 | 0 | 110 |
| | | | | 2 | 1 +-----+-----+------+
| e | 1 | | | | | 1 | 1 | 111 |
+------+-------+-----+-----+-----+-----+-----+-----+------+
编码的消息是
000100101110111 length = 15 bit
aaab c d e
可能性 2
如果有 2 种分割方式该列表使得两个部分彼此相等可能的话,那么将位于分裂点的那个字符放到包含高频繁字符的部分
+------+-------+-----+-----+-----+-----+-----+-----+------+
| | | round1 | round2 | round3 | |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
+------+-------+-----+-----+-----+-----+-----+-----+------+
| a | 3 | | | 3 | 0 | | 00 |
| | | 4 | 0 +-----+-----+ +------+
| b | 1 | | | 1 | 1 | | 01 |
| | +-----+-----+-----+-----+-----+-----+------+
| c | 1 | | | | | 1 | 0 | 100 |
| | | | | 2 | 0 |-----+-----+------+
| d | 1 | 3 | 1 | | | 1 | 1 | 101 |
| | | | +-----+-----+-----+-----+------+
| e | 1 | | | 1 | 1 | | 11 |
+------+-------+-----+-----+-----+-----+-----+-----+------+
编码消息是
0000000110010111 length = 16 bit
a a a b c d e
所以,它长了一位。
所以,我的问题是:
- 维基百科对香农法诺编码的描述真的正确和完整吗?如果是这种情况,那么香农-法诺编码就是不明确的。
- 或者法诺在他的论文中添加了维基百科描述中缺少的另一个步骤?如果是这种情况:Fano 是如何解决这里描述的问题的?这两个版本中哪一个与 Fano 的原始描述兼容?
In a nutshell:
Is the Shannon-Fano coding as described in Fano's paper The Transmission of Information (1952) really ambiguous?
In Detail:
3 papers
Claude E. Shannon published his famous paper A Mathematical Theory of Communication in July 1948. In this paper he invented the term bit as we know it today and he also defined what we call Shannon entropy today. And he also proposed an entropy based data compression algorithm in this paper. But Shannon's algorithm was so weak, that under certain circumstances the "compressed" messages could be even longer than in fix length coding. A few month later (March 1949) Robert M. Fano published an improved version of Shannons algorithm in the paper The Transmission of Information. 3 years after Fano (in September 1952) his student David A. Huffman published an even better version in his paper A Method for the Construction of Minimum-Redundancy Codes. Hoffman Coding is more efficient than its two predecessors and it is still used today. But my question is about the algorithm published by Fano which usually is called Shannon-Fano-Coding.
The algorithm
This description is based on the description from Wikipedia. Sorry, I did not fully read Fano's paper. I only browsed through it. It is 37 pages long and I really tried hard to find a passage where he talks about the topic of my question, but I could not find it. So, here is how Shannon-Fano encoding works:
- Count how often each character appears in the message.
- Sort all characters by frequency, characters with highest frequency on top of the list
- Divide the list into two parts, such that the sums of frequencies in both parts are as equal as possible. Add the bit
0
to one part and the bit1
to the other part. - Repeat step 3 on each part that contains 2 or more characters until all parts consist of only 1 character.
- Concatenate all bits from all rounds. This is the Shannon-Fano-code of that character.
An example
Let's execute this on a really tiny example (I think it's the smallest message where the problem appears). Here is the message to encode:
aaabcde
Steps 1 and 2 produce the first 2 columns of both tables shown below. But if Wikipedia's explanation of Fanos's algorithm is correct, then step 3 is ambiguous. If you apply this step on my example, you have two possibilities to split the list in 2 parts (see below). These possibilities produce different codes, which by itself would not be worth to be mentioned. But the point is: The two possibilities produce codes of different lengths.
possibility 1
If there are 2 ways to split the list such that both parts are as equal to each other as possible, then put that character, that stands at the splitting point (this is character b
in my example) to the part containing the low frequent characters
+------+-------+-----+-----+-----+-----+-----+-----+------+
| | | round1 | round2 | round3 | |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
+------+-------+-----+-----+-----+-----+-----+-----+------+
| a | 3 | 3 | 0 | | 0 |
| | +-----+-----+-----+-----+-----+-----+------+
| b | 1 | | | | | 1 | 0 | 100 |
| | | | | 2 | 0 +-----+-----+------+
| c | 1 | | | | | 1 | 1 | 101 |
| | | 4 | 1 +-----+-----+-----+-----+------+
| d | 1 | | | | | 1 | 0 | 110 |
| | | | | 2 | 1 +-----+-----+------+
| e | 1 | | | | | 1 | 1 | 111 |
+------+-------+-----+-----+-----+-----+-----+-----+------+
The encoded message is
000100101110111 length = 15 bit
aaab c d e
possibility 2
If there are 2 ways to split the list such that both parts are as equal to each other as possible, then put that character, that stands at the splitting point to the part containing the high frequent characters
+------+-------+-----+-----+-----+-----+-----+-----+------+
| | | round1 | round2 | round3 | |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
+------+-------+-----+-----+-----+-----+-----+-----+------+
| a | 3 | | | 3 | 0 | | 00 |
| | | 4 | 0 +-----+-----+ +------+
| b | 1 | | | 1 | 1 | | 01 |
| | +-----+-----+-----+-----+-----+-----+------+
| c | 1 | | | | | 1 | 0 | 100 |
| | | | | 2 | 0 |-----+-----+------+
| d | 1 | 3 | 1 | | | 1 | 1 | 101 |
| | | | +-----+-----+-----+-----+------+
| e | 1 | | | 1 | 1 | | 11 |
+------+-------+-----+-----+-----+-----+-----+-----+------+
The encoded message is
0000000110010111 length = 16 bit
a a a b c d e
So, it is one bit longer.
So, here are my questions:
- Is Wikipedia's description of Shannon-Fano Coding really correct and complete? If this is the case, than Shannon-Fano Coding is ambiguous.
- Or did Fano in his paper add another step that is missing in Wikipedia's description? If this is the case: How did Fano solve the problem described here? Which of both versions is compatible with Fano's original description?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
确切地说,算法中存在二义性。通过阅读维基百科页面和原始论文的算法部分,我找不到任何关于它的提示。
在实践中,为什么您会得到性能下降的解决方案?因为在可能性 2 中,您会得到两个桶,频率为 3 和 1,即频率相当不同。
这个问题在论文第 8 页中得到了解决(强调是我的):
但对于法诺来说,这个问题并不是那么重要。他的野心不是定义一个非常简单实用的算法来压缩一些由几个字符组成的小消息。他对理论方面更感兴趣。为此,他正在考虑非常单独的长消息(这些单独的消息是您示例中的字符):
根据这个假设,您观察到的现象不太可能发生,并且性能会显着下降。
It is exact that there is an ambiguity in the algorithm. I could not find any hint about it by reading both the Wikipedia page and the algorithm part of the original paper.
In practice, why do you get a solution with degraded performance? Because in Possibility 2, you get two buckets, with frequencies 3 and 1, i.e. rather different frequencies.
This issue is addressed in the paper (emphasize is mine), page 8:
But for Fano, this problem is not so important. His ambition is not to define a very simple and practical algorithm to compress some little messages consisting of a few characters. He is more interested by the theoretical aspects. For that, he is considering very individual long messages (these individual messages are characters in your example):
With this hypothesis, the phenomena that you observe is unlikely to happen with an important performance degradation.
为了直接回答您的问题,无需进一步详细说明如何打破联系,香农法诺的两种不同实现可以为相同的输入生成不同长度的不同代码。
正如 @MattTimmermans 在评论中指出的那样,Shannon-Fano 并不总是像霍夫曼编码那样产生最佳的无前缀编码。因此,将其视为一种算法,而更多地启发式可能会有所帮助 - 可能会产生良好的代码,但不能保证给出最佳的结果解决方案。许多启发式方法都遇到类似的问题,其中输入的微小调整或关系的打破方式可能会导致不同的结果。一个很好的例子是用于查找图的顶点着色的贪婪着色算法。链接的维基百科文章包括一个示例,其中更改相同基本算法访问节点的顺序会产生截然不同的结果。
然而,即使是产生最佳结果的算法有时也会根据抢七局产生不同的最佳结果。以霍夫曼编码为例,它的工作原理是反复查找迄今为止组装的两个权重最低的树并将它们合并在一起。如果在某个中间步骤中有三棵或更多树都被绑定到相同的权重,则霍夫曼编码的不同实现可能会根据它们连接在一起的两棵树产生不同的无前缀代码。不过,生成的树都同样“好”,因为它们都会产生相同长度的输出。 (这主要是因为,与 Shannon-Fano 不同,霍夫曼编码保证产生最佳编码。)
话虽如此,很容易调整 Shannon-Fano,使其始终产生一致的结果。例如,您可以说“如果出现平局,请选择将较少项目放入顶部组的分区”,此时您将始终一致地生成相同的编码。它不一定是最佳编码,但话又说回来,由于 Shannon-Fano 从未保证这样做,因此这可能不是一个主要问题。
另一方面,如果您对“当 Shannon-Fano 必须打破平局时,我如何决定如何打破平局以产生最佳解决方案?”的问题感兴趣,那么我不确定除了递归地尝试这两个选项并查看哪个更好之外,还有一种方法可以做到这一点,这在最坏的情况下会导致运行时间呈指数级缓慢。但也许这里的其他人可以找到一种方法来做到这一点>
To directly answer your question, without further elaboration about how to break ties, two different implementations of Shannon-Fano could produce different codes of different lengths for the same inputs.
As @MattTimmermans noted in the comments, Shannon-Fano does not always produce optimal prefix-free codings the way that, say, Huffman coding does. It might therefore be helpful to think of it less as an algorithm and more of a heuristic - something that likely will produce a good code but isn't guaranteed to give an optimal solution. Many heuristics suffer from similar issues, where minor tweaks in the input or how ties are broken could result in different results. A good example of this is the greedy coloring algorithm for finding vertex colorings of graphs. The linked Wikipedia article includes an example in which changing the order in which nodes are visited by the same basic algorithm yields wildly different results.
Even algorithms that produce optimal results, however, can sometimes produce different optimal results based on tiebreaks. Take Huffman coding, for example, which works by repeatedly finding the two lowest-weight trees assembled so far and merging them together. In the event that there are three or more trees at some intermediary step that are all tied for the same weight, different implementations of Huffman coding could produce different prefix-free codes based on which two they join together. The resulting trees would all be equally "good," though, in that they'd all produce outputs of the same length. (That's largely because, unlike Shannon-Fano, Huffman coding is guaranteed to produce an optimal encoding.)
That being said, it's easy to adjust Shannon-Fano so that it always produces a consistent result. For example, you could say "in the event of a tie, choose the partition that puts fewer items into the top group," at which point you would always consistently produce the same coding. It wouldn't necessarily be an optimal encoding, but, then again, since Shannon-Fano was never guaranteed to do so, this is probably not a major concern.
If, on the other hand, you're interested in the question of "when Shannon-Fano has to break a tie, how do I decide how to break the tie to produce the optimal solution?," then I'm not sure of a way to do this other than recursively trying both options and seeing which one is better, which in the worst case leads to exponentially-slow runtimes. But perhaps someone else here can find a way to do that>