哪些压缩/归档格式支持文件间压缩?
这个问题关于归档 PDF 的 让我想知道 - 如果我想压缩(出于归档目的)大量文件,而这些文件本质上是在主模板(信头)之上进行的小更改,那么似乎可以获得巨大的压缩增益具有文件间压缩功能。
标准压缩/归档格式是否支持此功能? AFAIK,所有流行的格式都专注于压缩每个单个文件。
This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有几种格式可以进行文件间压缩。
最古老的例子是 .tar.gz; .tar 没有压缩,但将所有文件连接在一起,每个文件前面都有标头,而 .gz 只能压缩一个文件。 两者都是按顺序应用的,这是 Unix 世界的传统格式。 .tar.bz2 是相同的,只是用 bzip2 而不是 gzip。
最近的示例是具有可选“固体”压缩的格式(例如 RAR 和 7-Zip),如果通过命令行标志或 GUI 选项启用,它们可以在压缩之前在内部连接所有文件。
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
看看google的open-vcdiff。
http://code.google.com/p/open-vcdiff/
设计用于计算小型压缩增量并实现 RFC 3284。
http://www.ietf.org/ rfc/rfc3284.txt
微软有一个 API 可以做类似的事情,但没有任何标准的外表。
一般来说,您正在寻找的算法是基于 Bentley/McIlroy 的算法:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
特别是,如果模板的大小大于窗口大小( ~32k) gzip 使用的块大小 (100-900k) bzip2 使用的块大小 (100-900k)。
Google 在其 BIGTABLE 实现内部使用它们来存储压缩网页,其原因与您查找它们的原因大致相同。
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
由于 LZW 压缩(几乎他们都使用)涉及在您进行过程中构建一个重复字符表,例如您想要的模式将限制您必须立即解压缩整个存档。
如果这在您的情况下是可以接受的,那么实现一种在压缩之前将文件合并成一个大文件的方法可能会更简单。
Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.