实际压缩和柯尔莫哥洛夫复杂度的界限

发布于 2024-10-11 19:12:59 字数 166 浏览 2 评论 0原文

我正在研究使用压缩作为衡量文档与文档语料库的关系的一种方法。这样做时,我在使用 bzip2 时发现了一个奇怪的结果; len(压缩(语料库))> len(压缩(语料库+新文档))。实际的压缩算法是否应该出现这种情况?在考虑数据的柯尔莫哥洛夫复杂度时,这在理论上是否可能? (想法是使用压缩算法来近似数据的复杂度)

I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this theoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乱了心跳 2024-10-18 19:12:59

是的,实际的压缩算法应该是这种情况,并且理论上可以使用 Kolmogorov 复杂度 。解释原因的最简单方法是举个例子。

假设以下内容:

  • 您的文档分隔符是
  • 语料库是文档 abc,def,abc,def,abc,def,abc,
  • 新文档是 def
  • 您的压缩算法(或柯尔莫哥洛夫描述语言)只允许重复通过使用重复计数作为前缀,后跟 | (这是 的变体游程编码)

然后:

  • compress(corpus) = "3|abc,def,1|abc"
  • compress(corpus+new_document) = "4|abc,def,"

所以 compress(corpus)compress(corpus+new_document) 长。这有点做作,但希望能够解释理论上如何通过简单的方案出现结果。我并不是说 bzip2 会发生这种情况,只是说明它在理论上是如何可能的。

编辑
在另一个答案中提到,游程编码不是图灵完备的,因此不能用于柯尔莫哥洛夫复杂度。虽然这是事实,但使用图灵语言,您可以以您选择使用的任何描述语言实现游程长度编码,并获得相同的结果,因此该示例仍然有效。

Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.

Suppose the following:

  • your document separator character is ,
  • corpus is documents abc,def,abc,def,abc,def,abc,
  • new document is def
  • your compression algorithm (or kolmogorov description language) just allows repetition by prefixing with a repeat count followed by | (this is a variant of run-length encoding)

Then:

  • compress(corpus) = "3|abc,def,1|abc"
  • compress(corpus+new_document) = "4|abc,def,"

So compress(corpus) is longer than compress(corpus+new_document). It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.

Edit
It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.

坐在坟头思考人生 2024-10-18 19:12:59

现实生活中的压缩算法有这样的怪癖,但无论如何它们只能提供非常粗略的近似值。

至于理论上是否会发生,有可能,但差别并不大。

假设您有两个字符串 x 和 y,其中 x 是 y 的前缀。例如,

x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"

y = "asdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"

让我们进一步假设 D 是y 的最短描述。 (即 K(y) = |D|)

在这种情况下,x 可以描述为 |“D 描述的数字减去 46 个字符”|,它比 D 长,但只长了一个小常数和一个对数因子(基本上其余指令中的字符数)。

甚至可能有对 x 的更短的描述,但我们知道在最坏的情况下, K(x) <= K(y)+log(|y|-|x|)

但是,您必须记住,理论上的 Kolmogorov 复杂度是不可计算的,并且持续的差异在这个领域没有任何意义。

(注:上面的 RLE 示例无效,因为 RLE 不是图灵完备语言,因此它不能用作 Kolmogorov 复杂性的描述语言。)

Real-life compression algorithms have quirks like this but they only provide for a very crude approximation anyway.

And as for whether it can happen in theory, probably, but the difference isn't significant.

Let's assume you've got two strings, x and y, where x is a prefix of y. Let's say for example that

x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"

y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"

Let's assume furthermore that D is the shortest description of y. (I.e. K(y) = |D|)

In this case, x can be described as |"the number described by D minus 46 characters"|, which is longer than D but only by a small constant and a logarithmic factor (the number of characters in the rest of the instructions basically).

There might even be a shorter description of x but we know that at worst, K(x) <= K(y)+log(|y|-|x|)

However, you have to bear in mind that theoretical Kolmogorov complexity is incomputable and constant differences mean nothing in this field.

(N.b.: The RLE example above isn't valid as RLE isn't a Turing complete language, therefore it can't be used as a description language for Kolmogorov complexity.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文