当前位置：文江博客话题详情

zip 压缩背后的概念是什么？

发布于 2024-08-04 02:28:25 字数 95 浏览 3 评论 0原文

zip 压缩背后的概念是什么？我可以理解删除空白空间等的概念，但大概必须添加一些内容来说明在解压过程中需要添加多少/何处可用空间？

压缩字节流的基本过程是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

踏雪无痕 2024-08-11 02:28:25

一个好的起点是查找 Huffman 压缩方案。霍夫曼背后的基本思想是，在给定文件中，某些字节比其他字节出现得更频繁（在纯文本文件中，许多字节根本不会出现）。与其花费 8 位来编码每个字节，为什么不使用较短的位序列来编码最常见的字符，并使用较长的序列来编码不太常见的字符（这些序列是通过创建哈夫曼树来确定的）。

一旦您掌握了使用这些树根据字符频率对文件进行编码/解码，想象一下您然后开始处理词频 - 而不是将“它们”编码为 4 个字符的序列，为什么不将其视为单个字符由于其频率而具有特征，允许它在霍夫曼树中分配自己的叶子。这或多或少是 ZIP 和其他无损类型压缩的基础 - 它们在文件中查找常见的“单词”（字节序列）（如果足够常见，则包括仅 1 个字节的序列）并使用树对它们进行编码。然后，zip 文件只需要包含树信息（每个序列的副本及其出现的次数），即可重建树并解码文件的其余部分。

跟进：

为了更好地回答原来的问题，无损压缩背后的想法与其说是删除空白，不如说是删除冗余信息。

如果您创建一个数据库来存储音乐歌词，您会发现大量空间被用来存储重复多次的合唱。您可以简单地将 CHORUS 一词放在合唱线的第一个实例之前，而不是使用所有空间，然后每次重复合唱时，只需使用 CHORUS 作为占位符（事实上，这几乎就是这个想法LZW 压缩之后 - 在 LZW 中，歌曲的每一行前面都会显示一个数字，如果一行在歌曲中稍后重复，则不要写出整行，只显示数字）。

回复收藏 0 原文

网名女生简单气质 2024-08-11 02:28:25

基本概念是，不使用八位来表示每个字节，而是使用更短的表示形式来表示更频繁出现的字节或字节序列。

例如，如果您的文件仅包含重复十六次的字节 0x41 (A)，则不要将其表示为 8 位序列 01000001，而是将其缩短为 1 -位序列0。那么该文件可以用0000000000000000（十六个0）来表示。因此，重复十六次的字节0x41的文件可以由重复两次的字节0x00组成的文件来表示。

因此，对于此文件（0x41 重复十六次），位 01000001 不会通过位 0 传达任何其他信息>。因此，在这种情况下，我们丢弃无关的位以获得更短的表示。

这就是压缩背后的核心思想。

作为另一个示例，考虑 8 字节模式

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

，现在重复它 2048 次。遵循上述方法的一种方法是使用三位来表示字节。

现在我们可以用 00000101 00111001 01110111（这是三字节模式 0x05 0x39 0x77）重复 2048 次来表示上述字节模式。

但更好的方法是

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

用单个位 0 来表示字节模式。然后我们可以用 0 重复 2048 次来表示上面的字节模式，变成字节 0x00 重复 256 次。现在我们只需要存储字典

0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

和字节模式 0x00 重复 256 次，我们将文件从 16,384 字节压缩到（以字典为模）256 字节。

简而言之，这就是压缩的工作原理。整个工作归结为在给定文件中找到字节和字节序列的简短、有效的表示。这是一个简单的想法，但细节（找到表示）可能相当具有挑战性。

例如，请参见：

The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.

For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.

So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.

That is the core idea behind compression.

As another example, consider the eight byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.

Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.

But an even better approach is to represent the byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary

0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.

That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.

See for example:

回复收藏 0 原文