压缩实用程序如何将文件顺序添加到压缩存档中?
例如,当您 tar -zcvf
目录时,您可以看到按顺序添加到最终 gzip 文件的文件列表。
但这是怎么发生的呢?
任何最基本级别的压缩算法都使用数据中的冗余来以更好的方式表示它,从而节省空间。
但是,当添加文件 n
时,已经选择了一种方式来表示前 n - 1
文件,这可能不是最佳方式,因为直到文件 n
发现我们永远不知道最好的方法是什么。
我错过了什么吗?如果不是,这是否意味着所有这些压缩算法都选择了某种次优的数据表示?
For example, when you tar -zcvf
a directory, you can see a list of files being added sequentially to the final gzip file.
But how does that happen?
Any compression algorithm at the very basic level uses the redundancy in data to represent it in a better way and hence save space.
But when file n
is being added, there is already a way chosen to represent the first n - 1
files which might not be the optimal one because until file n
came across we never knew what the best way was.
Am I missing something? If not, does this mean that all these compression algorithms choose some sub-optimal representation of data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 gzip 中,冗余仅限于特定的窗口大小(如果我没记错的话,默认为 32k)。这意味着在处理超过该窗口的未压缩数据后,您可以开始写入压缩输出。
您可以称其为“次优”,但所提供的好处,例如流式传输的能力,以及可能的错误恢复(如果窗口之间有同步标记;不确定 gzip 在这里如何工作),是值得的。
In gzip, the redundancy is restricted to a specific window size (by default 32k if I remember right). That means that after you process uncompressed data past that window, you can start writing compressed output.
You could call that "suboptimal", but the benefits provided, such as the ability to stream, and possibly error recovery (if there are synchronisation marks between windows; not sure how gzip works here), are worth it.
简短的回答是,它不会 - gzip 增量地工作,因此文件的第一部分通常不压缩得与文件的后面部分一样多。
这样做的好处是,压缩数据本身包含构建“字典”来解压缩数据所需的内容,因此您不必显式地传输带有数据的字典。
有一些压缩方法(例如,两次霍夫曼尼压缩),您可以扫描数据以找到该特定数据的理想“字典”,然后使用它来压缩数据。但是,当您执行此操作时,通常必须将字典与数据一起传输,以便能够在接收端对其进行解压缩。
这可能是一个合理的权衡——如果您有相当高的确定性,您将使用相同的字典压缩足够的数据,那么您从改进的压缩中获得的收益可能比传输时损失的更多。字典。但存在一个问题:文件中数据的“字符”经常在同一文件内发生变化,因此在文件的某一部分中效果最好的字典对于文件的另一部分可能根本不是很好。这对于压缩包含多个组成文件的 tar 文件尤其重要,每个组成文件可能(并且很可能)具有不同的冗余。
gzip 使用的增量/动态压缩很好地解决了这个问题,因为它使用的字典会根据最近看到的数据的窗口自动/不断地“调整”自身。主要缺点是内置了一点“滞后”,因此在数据“字符”发生变化的地方,压缩率将暂时下降,直到字典有机会“调整”以适应变化。
两遍算法可以改进数据的压缩,使数据在整个压缩流中保持相似。增量算法往往可以更好地适应更多可变数据。
The short answer is that it doesn't -- gzip works incrementally, so the first part of a file generally is not compressed quite as much as later parts of the file.
The good point of this is that the compressed data itself contains what's necessary to build a "dictionary" to decompress the data, so you never have to explicitly transmit the dictionary with the data.
There are methods of compression (e.g., two-pass Huffmany compression) where you scan through the data to find an ideal "dictionary" for that particular data, and then use it compress the data. When you do this, however, you generally have to transmit the dictionary along with the data to be able to decompress it on the receiving end.
That can be a reasonable tradeoff -- if you have a reasonably high level of certainty that you'll be compressing enough data with the same dictionary, you might gain more from the improved compression than you lose by transmitting the dictionary. There is one problem though: the "character" of the data in a file often changes within the same file, so the dictionary that works best in one part of the file may not be very good at all for a different part of the file. This is particularly relevant for compressing a tar file that contains a number of constituent files, each of which may (and probably will) have differing redundancy.
The incremental/dynamic compression that gzip uses deals with that fairly well, because the dictionary it uses is automatically/constantly "adjusting" itself based on a window of the most recently-seen data. The primary disadvantage is that there's a bit of a "lag" built in, so right where the "character" of the data changes, the compression will temporarily drop until the dictionary has had a chance to "adjust" to the change.
A two-pass algorithm can improve compression for data that remains similar throughout the entire stream you're compressing. An incremental algorithm tends to do a better job of adjusting to more variable data.
当您说
tar -zcvf X
时,相当于说:所以所有
gzip
看到的都是它压缩的字节,tar
和 < code>gzip 没有讨论tar
应如何为gzip
排序文件以最佳地压缩整个流。而且gzip
不知道tar
数据格式,因此它无法重新排列内容以实现更好的压缩。When you say
tar -zcvf X
, that is equivalent to saying:So all
gzip
sees is bunch of bytes that it compresses,tar
andgzip
don't have a conversation about howtar
should order the files forgzip
to optimially compress the entire stream. Andgzip
doesn't know thetar
data format so it cannot rearrange things for better compression.