纯文本压缩算法的现状如何?

发布于 2024-07-07 17:01:57 字数 146 浏览 12 评论 0原文

为了纪念哈特奖, 文本压缩的顶级算法(以及每种算法的快速描述)是什么?

注意:这个问题的目的是获得压缩算法的描述,而不是压缩程序的描述。

In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?

Note: The intent of this question is to get a description of compression algorithms, not of compression programs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

氛圍 2024-07-14 17:01:57

突破边界的压缩器结合了疯狂结果的算法。 常见算法包括:

  • Burrows-Wheeler 变换此处 - 使用可预测的算法对字符(或其他位块)进行混洗,以增加重复块,从而使源更容易压缩。 解压缩正常进行,并且结果未通过逆变换进行混洗。 注意:BWT 本身并不真正压缩任何东西。 它只是使源代码更容易压缩。
  • 部分匹配预测 (PPM) - 算术编码,其中预测模型(上下文)是通过处理有关源的统计数据而不是使用静态概率来创建的。 尽管其根源在于算术编码,但结果可以用霍夫曼编码或字典以及算术编码来表示。
  • 上下文混合 - 算术编码使用静态上下文进行预测,PPM 动态选择单个上下文,上下文混合使用许多上下文并权衡它们的结果。 PAQ 使用上下文混合。 这里是一个高级概述。
  • 动态马尔可夫压缩 - 与 PPM 相关,但使用位级上下文而不是字节或更长。
  • 此外,Hutter 奖参赛者可以使用外部词典中的小字节条目替换普通文本,并使用特殊符号来区分大小写文本,而不是使用两个不同的条目。 这就是为什么它们非常擅长压缩文本(尤其是 ASCII 文本),但对于一般压缩却没有那么有价值。

Maximum Compression 是一个非常酷的文本和通用压缩基准网站。 Matt Mahoney 发布了另一个基准。 Mahoney 可能特别令人感兴趣,因为它列出了每个条目使用的主要算法。

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:

  • The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
  • Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
  • Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
  • Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
  • In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.

Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

不及他 2024-07-14 17:01:57

总有 lzip

开个玩笑吧:

  • 在考虑兼容性的情况下,PKZIP(DEFLATE 算法)仍然获胜。
  • bzip2 是享受相对广泛的安装基础和相当好的压缩比之间的最佳折衷方案,但需要单独的存档器。
  • 7-ZipLZMA 算法)压缩效果非常好,并且可以在 LGPL 下使用。 然而,很少有操作系统附带内置支持。
  • rzip 是 bzip2 的一个变体,在我看来值得更多关注。 对于需要长期归档的巨大日志文件来说,这可能特别有趣。 它还需要一个单独的存档器。

There's always lzip.

All kidding aside:

  • Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
  • bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
  • 7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
  • rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.
对风讲故事 2024-07-14 17:01:57

如果您想将 PAQ 用作程序,您可以在基于 debian 的系统上安装 zpaq 软件包。 用法是(另请参阅man zpaq

zpaq c archivename.zpaq file1 file2 file3

压缩到大约zip 文件大小的1/10。 (1.9M 与 15M)

If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)

zpaq c archivename.zpaq file1 file2 file3

Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文