gzip 压缩异常？

发布于 2024-09-16 18:31:46 字数 218 浏览 2 评论 0原文

有没有什么方法可以预测在任意字符串上使用 gzip 会得到什么样的压缩结果？哪些因素会导致最坏和最好的情况？我不确定 gzip 是如何工作的，但是例如像这样的字符串：

"fffffff"

与像这样的字符串相比，可能会压缩得很好：

"abcdefg"

我从哪里开始？

谢谢

原文

Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:

"fffffff"

might compress well compared to something like:

"abcdefg"

where do I start?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

攒眉千度 2024-09-23 18:31:46

gzip 使用 deflate 算法，粗略地描述，该算法通过用指向的指针替换重复的字符串来压缩文件字符串的第一个实例。因此，高度重复的数据压缩效果非常好，而纯随机数据的压缩效果非常差（如果有的话）。

通过演示：由于

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

开销，我的纯随机数据样本实际上变得更大，而我的全零文件压缩到之前大小的 0.1%。

gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.

By means of demonstration:

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.

回复收藏 0 原文