压缩实用程序如何将文件顺序添加到压缩存档中?

发布于 2024-11-03 06:52:18 字数 306 浏览 1 评论 0原文

例如,当您 tar -zcvf 目录时,您可以看到按顺序添加到最终 gzip 文件的文件列表。

但这是怎么发生的呢?

任何最基本级别的压缩算法都使用数据中的冗余来以更好的方式表示它,从而节省空间。

但是,当添加文件 n 时,已经选择了一种方式来表示前 n - 1 文件,这可能不是最佳方式,因为直到文件 n 发现我们永远不知道最好的方法是什么。

我错过了什么吗?如果不是,这是否意味着所有这些压缩算法都选择了某种次优的数据表示?

For example, when you tar -zcvf a directory, you can see a list of files being added sequentially to the final gzip file.

But how does that happen?

Any compression algorithm at the very basic level uses the redundancy in data to represent it in a better way and hence save space.

But when file n is being added, there is already a way chosen to represent the first n - 1 files which might not be the optimal one because until file n came across we never knew what the best way was.

Am I missing something? If not, does this mean that all these compression algorithms choose some sub-optimal representation of data?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

乖乖 2024-11-10 06:52:18

在 gzip 中,冗余仅限于特定的窗口大小(如果我没记错的话,默认为 32k)。这意味着在处理超过该窗口的未压缩数据后,您可以开始写入压缩输出。

您可以称其为“次优”,但所提供的好处,例如流式传输的能力,以及可能的错误恢复(如果窗口之间有同步标记;不确定 gzip 在这里如何工作),是值得的。

In gzip, the redundancy is restricted to a specific window size (by default 32k if I remember right). That means that after you process uncompressed data past that window, you can start writing compressed output.

You could call that "suboptimal", but the benefits provided, such as the ability to stream, and possibly error recovery (if there are synchronisation marks between windows; not sure how gzip works here), are worth it.

独守阴晴ぅ圆缺 2024-11-10 06:52:18

简短的回答是,它不会 - gzip 增量地工作,因此文件的第一部分通常压缩得与文件的后面部分一样多。

这样做的好处是,压缩数据本身包含构建“字典”来解压缩数据所需的内容,因此您不必显式地传输带有数据的字典。

有一些压缩方法(例如,两次霍夫曼尼压缩),您可以扫描数据以找到该特定数据的理想“字典”,然后使用它来压缩数据。但是,当您执行此操作时,通常必须将字典与数据一起传输,以便能够在接收端对其进行解压缩。

可能是一个合理的权衡——如果您有相当高的确定性,您将使用相同的字典压缩足够的数据,那么您从改进的压缩中获得的收益可能比传输时损失的更多。字典。但存在一个问题:文件中数据的“字符”经常在同一文件内发生变化,因此在文件的某一部分中效果最好的字典对于文件的另一部分可能根本不是很好。这对于压缩包含多个组成文件的 tar 文件尤其重要,每个组成文件可能(并且很可能)具有不同的冗余。

gzip 使用的增量/动态压缩很好地解决了这个问题,因为它使用的字典会根据最近看到的数据的窗口自动/不断地“调整”自身。主要缺点是内置了一点“滞后”,因此在数据“字符”发生变化的地方,压缩率将暂时下降,直到字典有机会“调整”以适应变化。

两遍算法可以改进数据的压缩,使数据在整个压缩流中保持相似。增量算法往往可以更好地适应更多可变数据。

The short answer is that it doesn't -- gzip works incrementally, so the first part of a file generally is not compressed quite as much as later parts of the file.

The good point of this is that the compressed data itself contains what's necessary to build a "dictionary" to decompress the data, so you never have to explicitly transmit the dictionary with the data.

There are methods of compression (e.g., two-pass Huffmany compression) where you scan through the data to find an ideal "dictionary" for that particular data, and then use it compress the data. When you do this, however, you generally have to transmit the dictionary along with the data to be able to decompress it on the receiving end.

That can be a reasonable tradeoff -- if you have a reasonably high level of certainty that you'll be compressing enough data with the same dictionary, you might gain more from the improved compression than you lose by transmitting the dictionary. There is one problem though: the "character" of the data in a file often changes within the same file, so the dictionary that works best in one part of the file may not be very good at all for a different part of the file. This is particularly relevant for compressing a tar file that contains a number of constituent files, each of which may (and probably will) have differing redundancy.

The incremental/dynamic compression that gzip uses deals with that fairly well, because the dictionary it uses is automatically/constantly "adjusting" itself based on a window of the most recently-seen data. The primary disadvantage is that there's a bit of a "lag" built in, so right where the "character" of the data changes, the compression will temporarily drop until the dictionary has had a chance to "adjust" to the change.

A two-pass algorithm can improve compression for data that remains similar throughout the entire stream you're compressing. An incremental algorithm tends to do a better job of adjusting to more variable data.

幸福不弃 2024-11-10 06:52:18

当您说 tar -zcvf X 时,相当于说:

tar -cvf X | gzip 

所以所有 gzip 看到的都是它压缩的字节,tar 和 < code>gzip 没有讨论 tar 应如何为 gzip 排序文件以最佳地压缩整个流。而且 gzip 不知道 tar 数据格式,因此它无法重新排列内容以实现更好的压缩。

When you say tar -zcvf X, that is equivalent to saying:

tar -cvf X | gzip 

So all gzip sees is bunch of bytes that it compresses, tar and gzip don't have a conversation about how tar should order the files for gzip to optimially compress the entire stream. And gzip doesn't know the tar data format so it cannot rearrange things for better compression.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文