如果文件大小可以精确到位指定,则对于任何文件大小 N,将精确地存在 2^(N+1)-1 个可能的 N 位或更小的文件。为了将大小为 X 的文件映射到某个较小大小的 Y,必须将某个大小为 Y 或更小的文件映射到大小为 X 或更大的文件。无损压缩发挥作用的唯一方法是,某些可能的文件可以被识别为比其他文件更有可能;在这种情况下,可能的文件将缩小,而不太可能的文件将增大。
If file sizes could be specified accurate to the bit, for any file size N, there would be precisely 2^(N+1)-1 possible files of N bits or smaller. In order for a file of size X to be mapped to some smaller size Y, some file of size Y or smaller must be mapped to a file of size X or larger. The only way lossless compression can work is if some possible files can be identified as being more probable than others; in that scenario, the likely files will be shrunk and the unlikely ones will grow.
As a simple example, suppose that one wishes to store losslessly a file in which the bits are random and independent, but instead of 50% of the bits being set, only 33% are. One could compress such a file by taking each pair of bits and writing "0" if both bits were clear, "10" if the first bit was set and the second one not, "110" if the second was set and the first not, or "111" if both bits were set. The effect would be that each pair of bits would become one bit 44% of the time, two bits 22% of the time, and three bits 33% of the time. While some strings of data would grow, others would shrink; the pairs that shrank would--if the probability distribution was as expected--outnumber those that grow (4/9 files would shrink by a bit, 2/9 would stay the same, and 3/9 would grow, so pairs would on average shrink by 1/9 bit, and files would on average shrink by 1/18 [since the 1/9 figure was bits per pair]).
Note that if the bits actually had a 50% distribution, then only 25% of pairs would become one bit, 25% would stay two bits, and 50% would become three bits. Consequently, 25% of bits would shrink and 50% would grow, so pairs on average would grow by 25%, and files would grow by 12.5%. The break-even point would be about 38.2% of bits being set (two minus the golden mean), which would yield 38.2% of bit pairs shrinking and the same percentage growing.
There is no one universally best compression algorithm. Different algorithms have been invented to handle different data.
For example, JPEG compression allows you to compress images quite a lot because it doesn't matter too much if the red in your image is 0xFF or 0xFE (usually). However, if you tried to compress a text document, changes like this would be disastrous.
Also, even between two compression algorithms designed to work with the same kind of data, your results will vary depending on your data.
Example: Sometimes using a gzip tarball is smaller, and sometimes using a bzip tarball is smaller.
Lastly, for truly random data of sufficient length, your data will likely have almost the same size as (or even larger than) the original data.
发布评论
评论(2)
如果文件大小可以精确到位指定,则对于任何文件大小 N,将精确地存在 2^(N+1)-1 个可能的 N 位或更小的文件。为了将大小为 X 的文件映射到某个较小大小的 Y,必须将某个大小为 Y 或更小的文件映射到大小为 X 或更大的文件。无损压缩发挥作用的唯一方法是,某些可能的文件可以被识别为比其他文件更有可能;在这种情况下,可能的文件将缩小,而不太可能的文件将增大。
举一个简单的例子,假设一个人希望无损地存储一个文件,其中的位是随机且独立的,但不是 50% 的位被设置,而是只有 33% 被设置。可以通过获取每对位来压缩这样的文件,如果两个位都被清除,则写入“0”,如果设置了第一个位而第二个位未设置,则写入“10”,如果设置了第二个位而第一个位未设置,则写入“110” ,或“111”(如果两个位均已设置)。其效果是,每对比特在 44% 的情况下变为一位,在 22% 的情况下变为两位,在 33% 的情况下变为三位。虽然某些数据串会增长,但其他数据串会缩小;如果概率分布符合预期,缩小的文件对数量将超过增长的文件对(4/9 文件将缩小一点,2/9 将保持不变,3/9 将增长,因此文件对将继续存在)平均缩小 1/9 位,文件平均缩小 1/18 [因为 1/9 数字是每对位])。
请注意,如果这些位实际上具有 50% 的分布,则只有 25% 的对将变为一位,25% 将保留两位,50% 将变为三位。因此,25% 的位将缩小,50% 的位将增长,因此平均对将增长 25%,文件将增长 12.5%。盈亏平衡点约为 38.2% 的位被设置(2 减去黄金分割),这将产生 38.2% 的位对缩小和相同百分比的增长。
If file sizes could be specified accurate to the bit, for any file size N, there would be precisely 2^(N+1)-1 possible files of N bits or smaller. In order for a file of size X to be mapped to some smaller size Y, some file of size Y or smaller must be mapped to a file of size X or larger. The only way lossless compression can work is if some possible files can be identified as being more probable than others; in that scenario, the likely files will be shrunk and the unlikely ones will grow.
As a simple example, suppose that one wishes to store losslessly a file in which the bits are random and independent, but instead of 50% of the bits being set, only 33% are. One could compress such a file by taking each pair of bits and writing "0" if both bits were clear, "10" if the first bit was set and the second one not, "110" if the second was set and the first not, or "111" if both bits were set. The effect would be that each pair of bits would become one bit 44% of the time, two bits 22% of the time, and three bits 33% of the time. While some strings of data would grow, others would shrink; the pairs that shrank would--if the probability distribution was as expected--outnumber those that grow (4/9 files would shrink by a bit, 2/9 would stay the same, and 3/9 would grow, so pairs would on average shrink by 1/9 bit, and files would on average shrink by 1/18 [since the 1/9 figure was bits per pair]).
Note that if the bits actually had a 50% distribution, then only 25% of pairs would become one bit, 25% would stay two bits, and 50% would become three bits. Consequently, 25% of bits would shrink and 50% would grow, so pairs on average would grow by 25%, and files would grow by 12.5%. The break-even point would be about 38.2% of bits being set (two minus the golden mean), which would yield 38.2% of bit pairs shrinking and the same percentage growing.
没有一种通用的最佳压缩算法。人们发明了不同的算法来处理不同的数据。
例如,JPEG 压缩允许您对图像进行大量压缩,因为图像中的红色是否为 0xFF 或 0xFE(通常)并不重要。但是,如果您尝试压缩文本文档,这样的更改将是灾难性的。
此外,即使在设计用于处理相同类型数据的两种压缩算法之间,您的结果也会根据您的数据而有所不同。
示例:有时使用 gzip tarball 较小,有时使用 bzip tarball 较小。
最后,对于足够长度的真正随机数据,您的数据可能具有几乎与原始数据相同(甚至大于)的大小。
There is no one universally best compression algorithm. Different algorithms have been invented to handle different data.
For example, JPEG compression allows you to compress images quite a lot because it doesn't matter too much if the red in your image is 0xFF or 0xFE (usually). However, if you tried to compress a text document, changes like this would be disastrous.
Also, even between two compression algorithms designed to work with the same kind of data, your results will vary depending on your data.
Example: Sometimes using a gzip tarball is smaller, and sometimes using a bzip tarball is smaller.
Lastly, for truly random data of sufficient length, your data will likely have almost the same size as (or even larger than) the original data.