gzip 压缩异常?
有没有什么方法可以预测在任意字符串上使用 gzip 会得到什么样的压缩结果?哪些因素会导致最坏和最好的情况?我不确定 gzip 是如何工作的,但是例如像这样的字符串:
"fffffff"
与像这样的字符串相比,可能会压缩得很好:
"abcdefg"
我从哪里开始?
谢谢
Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:
"fffffff"
might compress well compared to something like:
"abcdefg"
where do I start?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
gzip 使用 deflate 算法,粗略地描述,该算法通过用指向的指针替换重复的字符串来压缩文件字符串的第一个实例。因此,高度重复的数据压缩效果非常好,而纯随机数据的压缩效果非常差(如果有的话)。
通过演示:由于
开销,我的纯随机数据样本实际上变得更大,而我的全零文件压缩到之前大小的 0.1%。
gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.
By means of demonstration:
My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.
gzip 使用的算法称为 DEFLATE。
它结合了两种流行的压缩技术:重复字符串消除和位减少。两者都在文章中进行了解释。
基本上,根据经验,您可以说,当某些字符比大多数其他字符更频繁使用和/或字符经常连续重复时,压缩效果最佳。当字符均匀分布在输入中并且每次都发生变化时,压缩会变得最差。
还有一些测量方法,例如数据的熵。
The algorithm used by gzip is called DEFLATE.
It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.
Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.
There are also measurements for this, like the entropy of the data.