压缩流的能力如何影响压缩算法?
我最近通过将其作为 tar 流发送并在我的端进行压缩来备份即将过期的大学主目录: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2。
这让我思考:我只知道压缩如何工作的基础知识,但我想这种压缩数据流的能力会导致较差的压缩,因为算法需要在某个时刻完成处理数据块,写下这个到输出流并继续下一个块。
是这样吗?或者这些程序只是将大量数据读入内存,压缩它,写入它,然后再重复一遍?或者这些“流压缩器”中使用了什么巧妙的技巧?我看到 bzip2 和 xz 的手册页都讨论了内存使用情况,并且 man bzip2 还暗示了这样一个事实:在将要压缩的数据切割成块:
较大的区块大小会导致边际收益迅速递减。大多数压缩来自块大小的前两三百 k,在小型机器上使用 bzip2 时值得记住这一事实。同样重要的是要认识到,解压缩内存要求是在压缩时通过选择块大小来设置的。
我仍然很想知道是否使用了其他技巧,或者我可以在哪里阅读更多相关内容。
I recently backed up my soon-to-expire university home directory by sending it as a tar stream and compressing it on my end: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2
.
This got me thinking: I only know the basics of how compression works, but I would imagine that this ability to compress a stream of data would lead to poorer compression since the algorithm needs to finish handling a block of data at one point, write this to the output stream and continue to the next block.
Is this the case? Or do these programs simply read a lot of data into memory compress this, write it, and then do this over again? Or are there any clever tricks used in these “stream compressors”? I see that both bzip2 and xz's man pages talk about memory usage, and man bzip2 also hints to the fact that little is lost on chopping the data to be compressed into blocks:
Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set at compression time by the choice of block size.
I would still love to hear if other tricks are used, or about where I can read more about this.
这个问题更多地与缓冲区处理相关,而不是与压缩算法相关,尽管也可以对此进行一些说明。
一些压缩算法本质上是“基于块的”,这意味着它们绝对需要使用特定大小的块。这是 bzip2 的情况,它的块大小是通过“级别”开关选择的,从 100kb 到 900kb。
因此,如果您将数据流式传输到其中,它将等待该块被填充,并在该块已满时开始压缩该块(或者,对于最后一个块,它将以其接收到的任何大小进行工作)。
其他一些压缩算法可以处理流,这意味着它们可以使用内存缓冲区中保存的旧数据连续压缩新数据。基于“滑动窗口”的算法可以做到这一点,通常 zlib 能够实现这一点。
现在,即使是“滑动窗口”压缩器也可能选择将输入数据切割成块,以便更轻松地进行缓冲区管理,或者开发多线程功能,例如 Pigz。
This question relates more to buffer handling than compression algorithm, although a bit could be said about it too.
Some compression algorithm are inherently "block based", which means they absolutely need to work with blocks of specific size. This is the situation of bzip2, which block size is selected thanks to the "level" switch, from 100kb to 900kb.
So, if you stream data into it, it will wait for the block to be filled, and start compressing this block when it's full (alternatively, for the last block, it will work with whatever size it receives).
Some other compression algorithm can handle streams, which means they can continuously compress new data using older one kept in a memory buffer. Algorithms based on "sliding windows" can do it, and typically zlib is able to achieve that.
Now, even "sliding window" compressors may nonetheless select to cut input data into blocks, either for easier buffer management, or to develop multi-threading capabilities, such as pigz.