通过网络发送高压缩文本文件

发布于 2024-08-05 02:34:12 字数 105 浏览 3 评论 0原文

我有一个文本文件想要通过网络发送,该文件的大小可能从 1KB 到 500KB 不等。
在发送该文件之前,我可以使用哪些算法/技术来紧密压缩该文件,以便通过网络发送的字节数最少且压缩率较高?

I have a text file that I want to send over the network, this file could vary in size from as low as 1KB to 500KB.
What algorithms/techniques could I use to tightly compress this file before sending it such that the least amount of bytes are send over the network and compression ratio is high?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

妄司 2024-08-12 02:34:12

对于压缩,我会考虑 gzip、bzip2 和 LZMA(这不是详尽的列表,但在我看来它们是最著名的)。

然后,我会在网上寻找一些基准,并尝试收集各种文件类型(文本、二进制、混合)和大小(小、大、巨大)的指标。即使您最感兴趣的是压缩比,您也可能需要查看:压缩比、压缩时间、内存占用、解压缩时间

根据快速基准测试:Gzip、Bzip2 与 LZMA

[...] gzip 速度非常快并且内存占用很小。根据这个基准测试,bzip2 和 lzma 在速度或内存使用方面都无法与 gzip 竞争。 bzip2 的压缩比明显优于 gzip,这也是 bzip2 流行的原因;它比 gzip 慢,尤其是在解压缩方面,并且使用更多内存。然而,即使在较旧的硬件上,bzip2 的内存要求现在也应该不成问题。

[...]

LZMA 显然有潜力成为 *NIX 系统上第三种常用的通用压缩格式。它与 bzip2 的主要竞争点是提供明显更好的压缩比,同时仍保持与 gzip 相对接近的解压缩速度。

这在 LZMA - 优于 bzip2 中得到了证实:

这个描述令人印象深刻,
简短:

  • 更好的压缩比(gzip 时具有最佳压缩级别)
    达到 38%,bzip2 34%,LZMA 已
    25%)。
  • 压缩比增益主要在二进制文件上
  • 解压缩时间比 bzip2 快得多(3-4 倍)。
  • 算法允许并行执行(但是工具
    我这里描述的是单线程)。

也有缺点:

  • 压缩(不包括较低级别)比 bzip2 慢得多。
  • 压缩过程中的内存要求比 bzip2 大得多。

因此,对于文本文件的压缩,同一站点报告:

我使用 LZMA 的第一件事是
压缩我的邮件存档。垃圾邮件
我选择的文件(mbox格式的邮件)是
528MB 大,我会使用最大
压缩比。压缩期间
lzma 进程有 370MB 大,那就是
多了:) bzip2 低于 7MB。花了
压缩文件大约需要15分钟
由 lzma 提供,不到 4 分钟
bzip2。压缩比非常
类似:输出文件为 373MB
bzip2 和 lzma 为 370MB。
lzma的减压时间为1m12s
bzip2 为 1 分 48 秒。

最后,这是另一个带有图形结果的资源:压缩工具:lzma 、bzip2 和gzip

我真的建议执行您自己的工作台(因为您将仅压缩文本并且压缩非常小的文件)以获得环境中的真实指标,但我打赌LZMA 不会在小文本文件上提供显着的优势,因此 bzip2 会是一个不错的选择(即使 LZMA 在小文件上的时间和内存开销可能较低)。

如果您计划从 Java 执行压缩,您将找到一个 LZMA 实现 此处,bzip2 实现此处(来自 Apache Ant AFAIK),< code>gzip 包含在 JDK 中。如果您不想或不能依赖第三方库,请使用 gzip。

For compression, I'd consider gzip, bzip2 and LZMA (this is not an exhaustive list but these are IMO the most famous).

Then, I'd look for some benchmarks on the net and try to gather metrics for various files type (text, binary, mixed) and size (small, big, huge). Even if you're mostly interested by compression ratio, you might want to look at: the compression ratio, the compression time, the memory footprint, the decompression time.

According to A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA:

[...] gzip is very fast and has small memory footprint. According to this benchmark, neither bzip2 nor lzma can compete with gzip in terms of speed or memory usage. bzip2 has notably better compression ratio than gzip, which has to be the reason for the popularity of bzip2; it is slower than gzip especially in decompression and uses more memory. However the memory requirements of bzip2 should be nowadays no problem even on older hardware.

[...]

LZMA clearly has potential to become the third commonly used general purporse compression format on *NIX systems. It mainly competes with bzip2 by offering significantly better compression ratio while still keeping decompressing speed relatively close to that of gzip.

This is confirmed in LZMA - better than bzip2:

The description is impressive, in
short:

  • Better compression ratio (with best compression level when gzip
    achieves 38%, bzip2 34%, LZMA has
    25%).
  • The compression is ratio gain is seen mainly on binary files.
  • Decompress time is much faster (3-4 times) than bzip2.
  • The algorithm allows to be executed in parallel (but the tool
    I'll describe here is one-thread).

There are also disadvantages:

  • Compression (excluding lower levels) is much slower than bzip2.
  • Memory requirements are much bigger during compression than bzip2.

So, for the compression of text files, the same site reports:

First thing I used LZMA for was
compressing my mail archive. The spam
file (mail in mbox format) I chose is
528MB big and I will use maximum
compression ratio. During compression
the lzma process was 370MB big, that's
much :) bzip2 was below 7MB. It took
almost 15 minutes to compress the file
by lzma and less than 4 minutes by
bzip2. Compression ration was very
similar: output file is 373MB for
bzip2 and 370MB for lzma.
Decompression time is 1m12s for lzma
and 1m48s for bzip2.

Finally, here is another resource with graphical results: Compression Tools: lzma, bzip2 & gzip

I'd really recommend to perform your own bench (as you'll be compressing text only and very small to small files) to get real metrics in your environment, but my bet is that LZMA won't provide a significant advantage on small text files so bzip2 would be a decent choice (even if the time and memory overhead of LZMA might be low on small files).

If you plan to perform the compression from Java, you'll find a LZMA implementation here, a bzip2 implementation here (coming from Apache Ant AFAIK), gzip being included in the JDK. If you don't want to or can't rely on a third party library, use gzip.

笑,眼淚并存 2024-08-12 02:34:12

答案取决于内容。 GZip 包含在 jdk 中。对随机字符串的测试似乎平均减少了 33% 的大小。

[编辑:内容,而不是上下文]

The answer depends on the content. GZip is included in the jdk. Tests on random strings seem to average 33% reduction in size.

[edit: content, not context]

梦屿孤独相伴 2024-08-12 02:34:12

这取决于。你能控制网络数据包的大小吗?如果一个包中可以容纳超过 1 个,您是否会将它们捆绑在一起?您是否受到两端 CPU 的限制?这不是真正的问题,但仍然相关,因为压缩和压缩可能需要更长的时间。解压缩而不是有时发送字节。

It depends. Can you control the network packet size? Are you going to bundle them if more than 1 will fit in a packet? Are you limited by CPU on either end? Not really the question, but still related since it can take longer to compress & decompress than to send the bytes at times.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文