当前位置：文江博客话题详情

Java Defender 策略 - DEFAULT_STRATEGY、FILTERED 和 HUFFMAN_ONLY

发布于 2024-08-26 22:38:39 字数 340 浏览 6 评论 0原文

在 gzip 压缩 Java Web 应用程序响应时，我试图在性能和压缩程度之间找到平衡。

在查看 Deflater 类时，我可以设定一个级别和一个策略。这些级别是不言自明的 - BEST_SPEED 到 BEST_COMPRESSION。

我不确定这些策略 - DEFAULT_STRATEGY、FILTERED 和 HUFFMAN_ONLY

我可以从 Javadoc 中理解一些道理，但我想知道是否有人有在他们的应用程序中使用了特定的策略，以及您是否发现性能/压缩程度方面有任何差异。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

誰ツ都不明白 2024-09-02 22:38:39

Java Deflater 中提到的策略选项起源于 ZLIB 的 zlib (C) 实现和 (RFC 1950 ）和 DEFLATE（1951），我相信。它们几乎存在于所有实现 DEFLATE 的压缩库中。

要理解它们的含义，您需要了解一些关于 DEFLATE 的信息。压缩算法结合了LZ77和Huffman编码。基本原理是：

LZ77 压缩的工作原理是查找重复的数据序列。实现通常使用 1k 到 32k 之间的“滑动窗口”来跟踪之前的数据。对于原始数据中的任何重复，LZ77 压缩不会在输出中插入重复数据，而是会插入“反向引用”。想象一下反向引用说“在这里，插入您在 8293 字节前看到的相同数据，17 字节”。 back-ref 被编码为这对数字：长度 - 在本例中为 17 - 和距离（或偏移量） - 在本例中为 8293。
霍夫曼编码用代码替换实际数据。当数据显示 X 时，霍夫曼代码显示 Y。显然，只有当替代数据比原始数据短时，这才有助于压缩。（反例是在金凯瑞电影Yes Man中，当Norm使用“Car “作为 Carl 的简称。Carl 指出 Carl 已经很短了。）霍夫曼编码算法进行频率分析，并使用最短的替代品来替换最常出现的数据序列。

Deflate 将这些结合起来，以便可以在 LZ77 反向引用上使用霍夫曼代码。各种 DEFLATE/ZLIB 压缩器上的策略选项只是告诉库 Huffman 与 LZ77 的权重是多少。

FILTERED 通常意味着 LZ77 匹配在长度为 5 时停止。因此，当文档显示
<块引用>
对过滤器（或预测器）生成的数据使用（过滤），...过滤的数据主要由具有一定随机分布的小值组成。

（来自 zlib 手册页）
...我读到的代码表明它进行 LZ77 匹配，但最多只能匹配 5 个或更少字节的序列。我猜这就是文档所说的“小值”的意思。但文档中没有提到数字 5，因此不能保证该数字不会从一个版本更改为另一个版本，或者从 ZLIB/DEFLATE 的一种实现更改为另一种实现（例如 C 版本和 Java 版本）。

HUFFMAN_ONLY 表示仅根据频率分析进行替换代码。 HUFFMAN_ONLY 非常非常快，但对于大多数数据来说压缩效果不是很好。除非您的字节值范围非常小（例如，如果实际数据流中的字节仅采用可能的 255 个值中的 20 个之一），或者以牺牲大小为代价对压缩速度有极端要求，则 HUFFMAN_ONLY 不会是你想要的。
DEFAULT 以作者预期的方式将两者结合起来，这对于大多数应用程序来说是最有效的。

据我所知，在 DEFLATE 中没有办法只做 LZ77。没有 LZ77_ONLY 策略。但当然，您可以构建或获取自己的 LZ77 编码器，这将是“仅限 LZ77”。

请记住，该策略永远不会影响压缩的正确性；它只影响它的操作以及它的性能，无论是速度还是大小。

还有其他方法可以调整压缩器。一是设置LZ77滑动窗口的大小。在 C 库中，这是通过“窗口位”选项指定的。如果您了解 LZ77，那么您就会知道较小的窗口意味着更少的回溯，这意味着更快的压缩，但代价是丢失一些匹配项。这通常是压缩时转动的更有效的旋钮。

底线是，对于 80% 的情况，您不关心调整策略。您可能有兴趣摆弄窗口位，只是为了看看会发生什么。但只有当您完成了应用程序中需要执行的所有其他操作后，才可以执行此操作。

参考资料：
DEFLATE 的工作原理，作者：Antaeus Feldspar

The strategy options mentioned in the Java Deflater originated in the zlib (C) implementation of ZLIB and (RFC 1950) and DEFLATE (1951), I believe. They are present in virtually all compression libraries that implement DEFLATE.

To understand what they mean, you need to know a little about DEFLATE. The compression algorithm combines LZ77 and Huffman coding. The basics are:

LZ77 compression works by finding sequences of data that are repeated. Implementations typically use a "sliding window" of between 1k and 32k, to keep track of data that went before. For any repeats in the original data, instead of inserting the repeated data in the output, the LZ77 compression inserts a "back-reference". Imagine the back reference saying "here, insert the same data you saw 8293 bytes ago, for 17 bytes". The back-ref is encoded as this pair of numbers: a length - in this case 17 - and a distance (or offset) - in this case, 8293.
Huffman coding substitutes codes for the actual data. When the data says X, the Huffman code says Y. This obviously helps compression only if the substitute is shorter than the original. (a counter-example is in the Jim Carrey movie Yes Man, when Norm uses "Car" for a shortname for Carl. Carl points out that Carl is already pretty short.) The Huffman encoding algorithm does a frequency analysis, and uses the shortest substitutes for the data sequences that occur most often.

Deflate combines these, so that one can use Huffman codes on LZ77 back-refs. The strategy options on various DEFLATE/ZLIB compressors just tells the library how much to weight Huffman versus LZ77.

FILTERED usually means the LZ77 matches are stopped at a length of 5. So when the documentation says
Use (Filtered) for data produced by a filter (or predictor), ... Filtered data consists mostly of small values with a somewhat random distribution.

(from the zlib man page)
...my reading of the code says that it does LZ77 matching, but only up to sequences of 5 or fewer bytes. That's what the doc means by "small values" I guess. But the number 5 isn't mentioned in the doc, so there's no guarantee that number won't be changed from rev to rev, or from one implementation of ZLIB/DEFLATE to another (like the C version and the Java version).

HUFFMAN_ONLY says, only do the substitution codes based on frequency analysis. HUFFMAN_ONLY is very very fast, but not very effective in compression for most data. Unless you have a very small range of byte values (for example, if bytes in your actual datastream take one of only 20 of the possible 255 values), or have extreme requirements for speed in compression at the expense of size, HUFFMAN_ONLY will not be what you want.
DEFAULT combines the two in the way the authors expected it would be most effective for most applications.

As far as I know, within DEFLATE there is no way to do only LZ77. There is no LZ77_ONLY strategy. But of course you could build or acquire your own LZ77 encoder and that would be "LZ77 only".

Keep in mind that the strategy never affects correctness of compression; it affects only and operation of it, and the performance of it, either in speed or size.

There are other ways to tweak the compressor. One is to set the size of the LZ77 sliding window. In the C library, this is specified with a "Window bits" option. If you understand LZ77, then you know that a smaller window means less searching back, which means faster compression, at the expense of missing some matches. This is often the more effective knob to turn when compressing.

The bottom line is that, for the 80% case, you don't care to tweak strategy. You might be interested in fiddling with window bits, just to see what happens. But only do that when you've done everything else you need to do in your app.

reference:
How DEFLATE works, by Antaeus Feldspar

回复收藏 0 原文

~没有更多了~