我将如何使用 C/C++ 等语言创建类似于 ZIP 存档的文件压缩器?
所以我在考虑 .zip 存档的结构,然后我想,如何创建自己的存档格式。
So I was thinking of how a .zip archive is structured and then I thought, how could I create my own archive format.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能想知道要压缩什么。 EG zip 对于很多事情都很有效,但对于音频文件则不太好。 FLAC 对于音频效果很好,但对于文本文件效果不佳(前提是您可以找到应用它的方法)。
一旦您有了压缩方案,您就可以分配适当的元数据,以便稍后可以解压缩信息,然后是压缩数据。
也许您会研究无损压缩方法,例如熵编码。您可能认为算术编码比霍夫曼编码更优化,并决定实现算术编解码器。如果您对压缩文本更感兴趣,您也可以查看字典编码。
编辑以回应评论
人们必须包括在对数据进行编码时决定的熵表,以便稍后对其进行解码。
以 JPEG 为例。 JPEG 使用色彩空间变换到 YCrCb、量化、离散余弦变换,然后对数据使用霍夫曼编码。颜色空间转换元数据包含在标头中。 (每种颜色有多少位,每个通道有多少样本,以及图像的大小。)包括量化表以及哪个表与哪个通道匹配的索引。并使用霍夫曼表对直流和交流系数进行编码。离散余弦变换和 ZigZag 系数模式是该标准的一部分。因此,在去量化之后,您必须对信息进行 IDCT 并对系数进行去锯齿处理。
尺寸和颜色。
您必须制定自己的标准,找出恢复信息所需的最少信息,并以可读的方式存储它,而无需知道内部的详细信息。
我不知道 .zip,但我想它会有几个字典表和几个熵表。您将对数据段进行去熵编码(必须以某种方式由标准或标记确定),然后使用反向字典替换。
You would want to know what you want to compress. E.G. zip works great for many things, but not so well for audio files. FLAC works well for audio, but poorly on text files ( provided you could find a way to apply it )
Once you had a compression scheme you would allocate the appropriate metadata so you could later decompress the information, followed by the compressed data.
Perhaps you would research A lossless compression method such as Entropy Encoding. You might decided that Arithmetic coding was more optimal than Huffman coding and decide to implement an Arithmetic codec. You might also look at Dictionary encoding if you are more interested in compressing text.
Edit in response to comment
One would have to include the entropy tables decided upon when encoding the data so it could be later decoded.
Take for example JPEG. JPEG uses a Colorspace transformation to YCrCb, Quantization, A Discrete Cosine Transformation, and then uses Huffman coding on the data. The color space transformation metadata is included in the headers. (how many bits per color and how many samples per channel, along with the size of the image. ) The quantization tables are included and an index of which table match which channel. And the used huffman tables to encode the DC and AC Coefficients. The Discrete Cosine Transformation and ZigZag Coefficient pattern is part of the standard. So after De-quantization you must IDCT the information and dezigzag the coefficients.
size and color.
You would have to make your own standard, figure out the minimum information needed to recover the information and store it in a way readable without knowing details of whats inside.
I don't know about .zip, but I would imagine it would have a couple dictionary tables and a couple entropy tables. You would de-entropy encode the datasegment (which must be somehow determined by standard or marker ), then use a reverse dictionary substitution.
下载 bzip2 的源代码并编译它们。然后从那里开始。
Download the sources of bzip2 and compile them. And then go from there.