理论上可能的最大压缩率是多少?
这是一个理论问题,因此这里的许多细节在实践中甚至在理论上都是不可计算的。
假设我有一个要压缩的字符串 s
。结果应该是一个输出 s
的自解压二进制文件(可以是 x86 汇编程序,但也可以是其他假设的图灵完备低级语言)。
现在,我们可以轻松地迭代所有可能的此类二进制文件和程序,并按大小排序。令B_s
为输出s
的这些二进制文件的子列表(当然B_s
是不可计算的)。
由于每组正整数都必须有一个最小值,因此 B_s
中必定存在一个最小的程序 b_min_s
。
对于哪些语言(即字符串集),我们知道 b_min_s
的大小?也许只是一个估计。 (我可以构造一些简单的例子,我什至可以计算 B_s
和 b_min_s
,但我对更有趣的语言感兴趣。)
This is a theoretical question, so expect that many details here are not computable in practice or even in theory.
Let's say I have a string s
that I want to compress. The result should be a self-extracting binary (can be x86 assembler, but it can also be some other hypothetical Turing-complete low level language) which outputs s
.
Now, we can easily iterate through all possible such binaries and programs, ordered by size. Let B_s
be the sub-list of these binaries who output s
(of course B_s
is uncomputable).
As every set of positive integers must have a minimum, there must be a smallest program b_min_s
in B_s
.
For what languages (i.e. set of strings) do we know something about the size of b_min_s
? Maybe only an estimation. (I can construct some trivial examples where I can always even calculate B_s
and also b_min_s
, but I am interested in more interesting languages.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是Kolmogorov 复杂度,你是正确的,它是不可计算。如果是的话,您可以创建一个长度为 n 的矛盾程序,打印一个柯尔莫哥洛夫复杂度为 m > 的字符串。名词
显然,您可以为给定输入绑定
b_min_s
。然而,据我所知,大多数这样做的努力都是存在的证据。例如,压缩英语维基百科正在进行一场竞赛。This is Kolmogorov complexity, and you are correct that it's not computable. If it were, you could create a paradoxical program of length n that printed a string with Kolmogorov complexity m > n.
Clearly, you can bound
b_min_s
for given inputs. However, as far as I know most of the efforts to do so have been existence proofs. For instance, there is an ongoing competition to compress English Wikipedia.Claude Shannon 估计英语语言的信息密度在每个字符 0.6 到 1.3 位之间他 1951 年的论文印刷英语的预测和熵(PDF,1.6MB。贝尔系统技术杂志(3)第 50-64 页)。
Claude Shannon estimated the information density of the English language to be somewhere between 0.6 and 1.3 bits per character in his 1951 paper Prediction and Entropy of Printed English (PDF, 1.6 MB. Bell Sys. Tech. J (3) p. 50-64).
最大(平均)压缩率为 1:1。
可能的输入数量等于输出数量。
它必须能够将输出映射回输入。
为了能够存储输出,您需要与输入的最小容器大小相同的容器 - 提供 1:1 的压缩率。
The maximal (avarage) compression rate possible is 1:1.
The number of possible inputs is equal to the number of outputs.
It has to be to be able to map the output back to the input.
To be able to store the output you need container at the same size as the minimal container for the input - giving 1:1 compression rate.
基本上,您需要足够的信息来重建原始信息。我想其他答案对您的理论讨论更有帮助,但请记住这一点。
Basically, you need enough information to rebuild your original information. I guess the other answers are more helpful for your theoretical discussion, but just keep this in mind.