我需要选择一种压缩算法

发布于 2024-08-23 18:02:32 字数 238 浏览 1 评论 0 原文

我需要选择一种压缩算法来压缩一些数据。我不知道我将提前压缩的数据类型(可以将其想象为有点像 WinRAR 程序)。

我听说过以下算法,但我不知道应该使用哪一种。谁能发布一个简短的优点和缺点列表?对于我的应用来说,首要考虑的是解压速度;第二要务是节省空间。压缩(不是解压)速度无关紧要。

  • 放气
  • 内爆
  • 平原霍夫曼
  • bzip2
  • lzma

I need to choose a compression algorithm to compress some data. I don't know the type of data I'll be compressing in advance (think of it as kinda like the WinRAR program).

I've heard of the following algorithms but I don't know which one I should use. Can anyone post a short list of pros and cons? For my application the first priority is decompression speed; the second priority is space saved. Compression (not decompression) speed is irrelevant.

  • Deflate
  • Implode
  • Plain Huffman
  • bzip2
  • lzma

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

心安伴我暖 2024-08-30 18:02:32

我运行了一些基准测试来压缩包含高熵数据和文本的混合的 .tar。结果如下:

Name  - Compression rate* - Decompression Time
7zip  - 87.8%             - 0.703s
bzip2 - 80.3%             - 1.661s
gzip  - 72.9%             - 0.347s
lzo   - 70.0%             - 0.111s

*Higher is better

由此我得出的结论是,算法的压缩率取决于它的名称;按字母顺序排列的第一个将是具有最佳压缩率的那个,依此类推。

因此我决定将 lzo 重命名为 1lzo。现在我拥有有史以来最好的算法。


编辑:值得注意的是,不幸的是,lzo是所有这些中唯一拥有非常严格的许可证(GPL)的:(

I ran a few benchmarks compressing a .tar that contained a mix of high entropy data and text. These are the results:

Name  - Compression rate* - Decompression Time
7zip  - 87.8%             - 0.703s
bzip2 - 80.3%             - 1.661s
gzip  - 72.9%             - 0.347s
lzo   - 70.0%             - 0.111s

*Higher is better

From this I came to the conclusion that the compression rate of an algorithm depends on its name; the first in alphabetical order will be the one with the best compression rate, and so on.

Therefore I decided to rename lzo to 1lzo. Now I have the best algorithm ever.


EDIT: worth noting that of all of them unfortunately lzo is the only one with a very restrictive license (GPL) :(

落叶缤纷 2024-08-30 18:02:32

如果您需要高解压速度,那么您应该使用 LZO。它的压缩速度和比率都不错,但它的解压速度很难被超越。

If you need high decompression speed then you should be using LZO. Its compression speed and ratio are decent, but it's hard to beat its decompression speed.

倒带 2024-08-30 18:02:32

在 Linux 内核中,它得到了很好的解释(从包含的内容中):

  • Deflate (gzip) - 快速,最差压缩
  • bzip2 - 慢速,中等压缩
  • lzma - 非常慢的压缩,快速解压缩(但是比 gzip 慢),最佳压缩

我还没有使用其他的,所以很难说,但算法的速度可能很大程度上取决于架构。例如,有研究表明 HDD 上的数据压缩可以加速 I/O,因为处理器比磁盘快得多,所以这是值得的。然而,这在很大程度上取决于瓶颈的大小。

同样,一种算法可能会大量使用内存,这可能会也可能不会导致问题(12 MiB - 它是很多还是很小?在嵌入式系统上它很多;在现代 x86 上它只是内存的一小部分)。

In the Linux kernel it is well explained (from those included):

  • Deflate (gzip) - Fast, worst compression
  • bzip2 - Slow, middle compression
  • lzma - Very slow compression, fast decompression (however slower than gzip), best compression

I haven't use others, so it is hard to say, but speeds of algorithms may depend largely on architecture. For example, there are studies that data compression on the HDD speeds the I/O, as the processor is so much faster than the disk that it is worth it. However, it depends largely on the size of bottlenecks.

Similarly, one algorithm may use memory extensively, which may or may not cause problems (12 MiB -- is it a lot or very small? On embedded systems it is a lot; on a modern x86 it is tiny fragment of memory).

半世蒼涼 2024-08-30 18:02:32

看一下 7zip。它是开源的,包含 7 种独立的压缩方法。我们所做的一些小测试表明,7z 格式提供的结果文件比 zip 小得多,而且对于我们使用的示例数据来说,它的速度也更快。

由于我们的标准压缩是zip,所以我们还没有考虑其他压缩方法。

Take a look at 7zip. It's open source and contains 7 separate compression methods. Some minor testing we've done shows the 7z format gives a much smaller result file than zip and it was also faster for the sample data we used.

Since our standard compression is zip, we didn't look at other compression methods yet.

拥醉 2024-08-30 18:02:32

有关文本数据的全面基准,您可能需要查看大文本压缩基准

对于其他类型,这可能是指示性的

For a comprehensive benchmark on text data you might want to check out the Large Text Compression Benchmark.

For other types, this might be indicative.

稚气少女 2024-08-30 18:02:32

目前最快压缩算法之一是LZ4,据报道达到了RAM速度减压期间的限制。

另一方面,通常提供最佳压缩比率的算法是LZMA2,由xz7z。但是,有两个警告:

Zstandard,速度很快,但也可以提供与 LZMA 竞争的比率。

如今另一个流行的选择是 Brotli,它更注重速度而不是实现最高压缩比。 HTTP 协议最近添加了对 Zstd 和 Brotli Content-Encoding 的支持。

基准测试中的获胜者曾经是 PAQ,但它并未得到广泛使用我找不到它的积极维护的实现。

One of the fastest compression algorithms these days is LZ4, reportedly reaching RAM speed limits during decompression.

On the other hand, an algorithm generally providing the best compression ratios is LZMA2, used by xz and 7z. However, two caveats:

A good balance is provided by Zstandard, which is fast but can also provide ratios competitive with LZMA.

Another popular option nowadays is Brotli, which is more focused on speed rather than achieving the highest compression ratio. Support for both Zstd and Brotli Content-Encodings have recently been added to the HTTP protocol.

The winner in benchmarks used to be PAQ, however it isn't widely used and I couldn't find an actively maintained implementation of it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文