熵与无损压缩率的关系
从香农源编码定理我们知道压缩字符串的熵受原始字符串的熵,如下所示:
H(X) <= L < H(X) + 1/N
其中 H(X) 是源字符串的熵,N 是源字符串的长度,L 是压缩字符串的预期长度。
这必然意味着无损压缩存在限制。
我想知道的是:
我们可以直接将熵与某个预期的压缩比联系起来吗?
我们可以使用熵来找到压缩比的一些上限吗?
From Shannon's Source Coding Theorem we know that the entropy of a compressed string is bounded by the entropy of the original string like so:
H(X) <= L < H(X) + 1/N
where H(X) is entropy of the source string, N is the length of the source string, and L is the expected length of the compressed string.
This necessarily means that there is a limit to lossless compression.
What I'd like to know is:
Can we directly relate entropy to some expected compression ratio?
Can we use the entropy to find some upper bound for the compression ratio?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
香农定理是根据随机数据和概率定义的。 类似地,字符串的熵仅针对随机字符串定义——熵是分布的属性,而不是字符串本身的属性。 因此,我们可以将香农定理非正式地重述为:
给定任何随机字符串,我可以轻松编写一个压缩算法,将该字符串压缩为 1 位,但我的算法必然会增加其他一些字符串的长度。 我的压缩算法的工作原理如下:
对应的解压算法是:
是我们无法写出一个算法,对于给定分布的所有字符串,平均以较高的速率压缩它们全部。 字符串太多了。
如果我们有给定的字符串概率分布,我们可以计算该分布的熵率,然后根据分布随机选择一个字符串,并尝试使用any对其进行压缩算法中,压缩字符串的相对大小平均而言永远不会小于熵率。 这就是香农定理所说的。
Shannon's Theorem is defined in terms of random data and probabilities. Similarly, the entropy of a string is only defined for random strings -- the entropy is a property of the distribution, not of the strings themselves. So, we can restate Shannon's Theorem informally as:
Given any random string, I can easily write a compression algorithm which will compress that string down into 1 bit, but my algorithm will necessarily increase the length of some other strings. My compression algorithm works as follows:
The corresponding decompression algorithm is:
The key here is that we can't write down one algorithm which, for all strings from a given distribution, compresses them all at a high rate on average. There's just too many strings.
If we have a given probability distribution of strings, we can calculate the entropy rate of the distribution, and then if randomly pick a string according to the distribution and attempt to compress it using any algorithm, the relative size of the compressed string will, on average, never be less than the entropy rate. This is what Shannon's Theorem says.
是的。 英语的熵率通常被引用为每个字符 1.5 位(给予或接受)。 典型的编码每个字符使用 8 位。 因此,最大压缩文本的大小应为原始大小的 1.5/8 (~19%)。 简·奥斯汀的《傲慢与偏见》纯文本版本的实际结果:orig = 701K,bzip2 = 178K,约 25%。
Yes. The entropy rate of the English language is often cited as 1.5 bits per character (give or take). Typical encodings use 8 bits per character. So a maximally compressed text should be 1.5/8 (~19%) the size of the original. Actual results for a plain text version of Jane Austin's Pride and Prejudice: orig = 701K, bzip2 = 178K, for ~25%.
在不知道源字符串长度的情况下,您无法直接将熵与压缩比相关联,但是您可以通过求解 L 的最小可能值来了解最大压缩比的理论限制。您可以使用此限制作为度量压缩算法的效率,尽管不好的指标并不意味着已经发现甚至存在更好的算法。
所以,是的。 您可以使用熵来查找理论上的最大无损压缩比,但是不行,您不能使用它来确定任何给定压缩算法的预期压缩比。
You can't directly relate entropy to compression ratio without knowing the length of the source string, but you can see the theoretical limit to the maximum compression ratio by solving for the smallest possible value of L. You can use this limit as a metric for efficiency of your compression algorithms, although a bad metric doesn't mean that a better algorithm has been discovered or even exists.
So, yes. You can use entropy to find the theoretical maximum lossless compression ratio, but no, you can't use it to determine your expected compression ratio for any given compression algorithm.
是的! 我认为这篇论文< /a> 会给你指出正确的方向。
预计到达时间 看来您需要成为 IEEE 会员才能阅读实际的论文。 如果有人可以找到公开可用的资源(或在此处解释数学),那当然会更好!
Yes! I think this paper would point you in the right direction.
ETA Looks like you need to be an IEEE member to read the actual paper. If someone could find a publicly available resource (or explain the math here), that would be much better of course!