zip 压缩背后的概念是什么?
zip 压缩背后的概念是什么?我可以理解删除空白空间等的概念,但大概必须添加一些内容来说明在解压过程中需要添加多少/何处可用空间?
压缩字节流的基本过程是什么?
What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?
What's the basic process for compressing a stream of bytes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一个好的起点是查找 Huffman 压缩方案。霍夫曼背后的基本思想是,在给定文件中,某些字节比其他字节出现得更频繁(在纯文本文件中,许多字节根本不会出现)。与其花费 8 位来编码每个字节,为什么不使用较短的位序列来编码最常见的字符,并使用较长的序列来编码不太常见的字符(这些序列是通过创建哈夫曼树来确定的)。
一旦您掌握了使用这些树根据字符频率对文件进行编码/解码,想象一下您然后开始处理词频 - 而不是将“它们”编码为 4 个字符的序列,为什么不将其视为单个字符由于其频率而具有特征,允许它在霍夫曼树中分配自己的叶子。这或多或少是 ZIP 和其他无损类型压缩的基础 - 它们在文件中查找常见的“单词”(字节序列)(如果足够常见,则包括仅 1 个字节的序列)并使用树对它们进行编码。然后,zip 文件只需要包含树信息(每个序列的副本及其出现的次数),即可重建树并解码文件的其余部分。
跟进:
为了更好地回答原来的问题,无损压缩背后的想法与其说是删除空白,不如说是删除冗余信息。
如果您创建一个数据库来存储音乐歌词,您会发现大量空间被用来存储重复多次的合唱。您可以简单地将 CHORUS 一词放在合唱线的第一个实例之前,而不是使用所有空间,然后每次重复合唱时,只需使用 CHORUS 作为占位符(事实上,这几乎就是这个想法LZW 压缩之后 - 在 LZW 中,歌曲的每一行前面都会显示一个数字,如果一行在歌曲中稍后重复,则不要写出整行,只显示数字)。
A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).
Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.
Follow up:
To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.
If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)
基本概念是,不使用八位来表示每个字节,而是使用更短的表示形式来表示更频繁出现的字节或字节序列。
例如,如果您的文件仅包含重复十六次的字节 0x41 (
A
),则不要将其表示为 8 位序列01000001
,而是将其缩短为 1 -位序列0
。那么该文件可以用0000000000000000
(十六个0
)来表示。因此,重复十六次的字节0x41
的文件可以由重复两次的字节0x00
组成的文件来表示。因此,对于此文件(
0x41
重复十六次),位01000001
不会通过位0
传达任何其他信息>。因此,在这种情况下,我们丢弃无关的位以获得更短的表示。这就是压缩背后的核心思想。
作为另一个示例,考虑 8 字节模式
,现在重复它 2048 次。遵循上述方法的一种方法是使用三位来表示字节。
现在我们可以用
00000101 00111001 01110111
(这是三字节模式0x05 0x39 0x77
)重复 2048 次来表示上述字节模式。但更好的方法是
用单个位
0
来表示字节模式。然后我们可以用0
重复 2048 次来表示上面的字节模式,变成字节0x00
重复 256 次。现在我们只需要存储字典和字节模式
0x00
重复 256 次,我们将文件从 16,384 字节压缩到(以字典为模)256 字节。简而言之,这就是压缩的工作原理。整个工作归结为在给定文件中找到字节和字节序列的简短、有效的表示。这是一个简单的想法,但细节(找到表示)可能相当具有挑战性。
例如,请参见:
The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.
For example, if your file consists solely of the byte 0x41 (
A
) repeated sixteen times, then instead of representing it as the 8-bit sequence01000001
shorten it to the 1-bit sequence0
. Then the file can be represented by0000000000000000
(sixteen0
s). So then the file of the byte0x41
repeated sixteen times can be represented by the file consisting of the byte0x00
repeated twice.So what we have here is that for this file (
0x41
repeated sixteen times) the bits01000001
don't convey any additional information over the bit0
. So, in this case, we throw away the extraneous bits to obtain a shorter representation.That is the core idea behind compression.
As another example, consider the eight byte pattern
and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.
Now we can represent the above byte pattern by
00000101 00111001 01110111
(this is the three-byte pattern0x05 0x39 0x77
) repeated 2048 times.But an even better approach is to represent the byte pattern
by the single bit
0
. Then we can represent the above byte pattern by0
repeated 2048 times which becomes the byte0x00
repeated 256 times. Now we only need to store the dictionaryand the byte pattern
0x00
repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.
See for example:
压缩之间的概念基本上是统计的。如果您有一系列字节,则实际上字节 N 为 X 的机会取决于前面字节 0..N-1 的值分布。如果不进行压缩,则为每个可能的值 X 分配 8 位。通过压缩,为每个值 X 分配的字节量取决于估计的机会 p(N,X)。
例如,给定序列“aaaa”,压缩算法可以向 p(5,a) 分配较高的值,向 p(5,b) 分配较低的值。当 p(X) 为高时,分配给 X 的位串将很短,当 p(X) 为低时,将使用长位串。这样,如果 p(N,X) 是一个好的估计,那么平均位串将短于 8 位。
The concept between compression is basically statististical. If you've got a series of bytes, the chance of byte N being X in practice depends on the value distribution of the previous bytes 0..N-1. Without compression, you allocate 8 bits for each possible value X. With compression, the amounts of bytes allocated for each value X depends on the estimated chance p(N,X).
For instance, given a sequence "aaaa", a compression algorithm can assign a high value to p(5,a) and lower values to p(5,b). When p(X) is high, the bitstring assigned to X will be short, when p(X) is low a long bitstring is used. In this way, if p(N,X) is a good estimate then the average bitstring will be shorter than 8 bits.