gzip 压缩异常?

发布于 2024-09-16 18:31:46 字数 218 浏览 2 评论 0原文

有没有什么方法可以预测在任意字符串上使用 gzip 会得到什么样的压缩结果?哪些因素会导致最坏和最好的情况?我不确定 gzip 是如何工作的,但是例如像这样的字符串:

"fffffff"

与像这样的字符串相比,可能会压缩得很好:

"abcdefg"

我从哪里开始?

谢谢

Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:

"fffffff"

might compress well compared to something like:

"abcdefg"

where do I start?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

攒眉千度 2024-09-23 18:31:46

gzip 使用 deflate 算法,粗略地描述,该算法通过用指向的指针替换重复的字符串来压缩文件字符串的第一个实例。因此,高度重复的数据压缩效果非常好,而纯随机数据的压缩效果非常差(如果有的话)。

通过演示:由于

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

开销,我的纯随机数据样本实际上变得更大,而我的全零文件压缩到之前大小的 0.1%。

gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.

By means of demonstration:

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.

旧时浪漫 2024-09-23 18:31:46

gzip 使用的算法称为 DEFLATE

它结合了两种流行的压缩技术:重复字符串消除和位减少。两者都在文章中进行了解释。

基本上,根据经验,您可以说,当某些字符比大多数其他字符更频繁使用和/或字符经常连续重复时,压缩效果最佳。当字符均匀分布在输入中并且每次都发生变化时,压缩会变得最差。

还有一些测量方法,例如数据的

The algorithm used by gzip is called DEFLATE.

It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.

Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.

There are also measurements for this, like the entropy of the data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文