实际压缩和柯尔莫哥洛夫复杂度的界限
我正在研究使用压缩作为衡量文档与文档语料库的关系的一种方法。这样做时,我在使用 bzip2 时发现了一个奇怪的结果; len(压缩(语料库))> len(压缩(语料库+新文档))。实际的压缩算法是否应该出现这种情况?在考虑数据的柯尔莫哥洛夫复杂度时,这在理论上是否可能? (想法是使用压缩算法来近似数据的复杂度)
I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this theoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,实际的压缩算法应该是这种情况,并且理论上可以使用 Kolmogorov 复杂度 。解释原因的最简单方法是举个例子。
假设以下内容:
,
|
(这是 的变体游程编码)然后:
所以
compress(corpus)
比compress(corpus+new_document)
长。这有点做作,但希望能够解释理论上如何通过简单的方案出现结果。我并不是说 bzip2 会发生这种情况,只是说明它在理论上是如何可能的。编辑
在另一个答案中提到,游程编码不是图灵完备的,因此不能用于柯尔莫哥洛夫复杂度。虽然这是事实,但使用图灵语言,您可以以您选择使用的任何描述语言实现游程长度编码,并获得相同的结果,因此该示例仍然有效。
Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.
Suppose the following:
,
|
(this is a variant of run-length encoding)Then:
So
compress(corpus)
is longer thancompress(corpus+new_document)
. It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.Edit
It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.
现实生活中的压缩算法有这样的怪癖,但无论如何它们只能提供非常粗略的近似值。
至于理论上是否会发生,有可能,但差别并不大。
假设您有两个字符串 x 和 y,其中 x 是 y 的前缀。例如,
x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"
y = "asdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"
让我们进一步假设 D 是y 的最短描述。 (即 K(y) = |D|)
在这种情况下,x 可以描述为 |“D 描述的数字减去 46 个字符”|,它比 D 长,但只长了一个小常数和一个对数因子(基本上其余指令中的字符数)。
甚至可能有对 x 的更短的描述,但我们知道在最坏的情况下, K(x) <= K(y)+log(|y|-|x|)
但是,您必须记住,理论上的 Kolmogorov 复杂度是不可计算的,并且持续的差异在这个领域没有任何意义。
(注:上面的 RLE 示例无效,因为 RLE 不是图灵完备语言,因此它不能用作 Kolmogorov 复杂性的描述语言。)
Real-life compression algorithms have quirks like this but they only provide for a very crude approximation anyway.
And as for whether it can happen in theory, probably, but the difference isn't significant.
Let's assume you've got two strings, x and y, where x is a prefix of y. Let's say for example that
x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"
y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"
Let's assume furthermore that D is the shortest description of y. (I.e. K(y) = |D|)
In this case, x can be described as |"the number described by D minus 46 characters"|, which is longer than D but only by a small constant and a logarithmic factor (the number of characters in the rest of the instructions basically).
There might even be a shorter description of x but we know that at worst, K(x) <= K(y)+log(|y|-|x|)
However, you have to bear in mind that theoretical Kolmogorov complexity is incomputable and constant differences mean nothing in this field.
(N.b.: The RLE example above isn't valid as RLE isn't a Turing complete language, therefore it can't be used as a description language for Kolmogorov complexity.)