在 deflate 算法中确定块大小有哪些好的策略?
我正在编写一个压缩库作为一个小项目,而且我已经足够成熟了(我的库可以提取任何标准 gzip 文件,并生成兼容的(但肯定还不是最佳的)gzip 输出),是时候弄清楚了制定一个有意义的块终止策略。 目前,我只是在每 32k 个输入(LZ77 窗口大小)后切断块,因为它实现起来很方便且快速 - 现在我要回去并尝试实际提高压缩效率。
Deflate 规范对此只有这样的说法:“压缩器终止一个块当它确定用新鲜树开始一个新块会很有用时,或者当块大小填满压缩器的块缓冲区时”,这并没有多大帮助。
我对 SharpZipLib 代码进行了排序(因为我认为它将是最容易阅读的开源实现),发现它每 16k 输出文字就终止一个块,忽略输入。 这很容易实现,但似乎必须有一些更有针对性的方法,特别是考虑到规范中的语言“确定用新鲜树开始一个新块将是有用的”。
那么有人对新策略有什么想法,或者现有策略的例子吗?
提前致谢!
I'm writing a compression library as a little side project, and I'm far enough along (My library can extract any standard gzip file, as well as produce compliant (but certainly not yet optimal) gzip output) that it's time to figure out a meaningful block termination strategy. Currently, I just cut the blocks off after every 32k of input (LZ77 window size) because it was conveinent and quick to implement -- now I am going back and trying to actually improve compression efficiency.
The Deflate spec has only this to say about it: "The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer", which isn't all that helpful.
I sorted through the SharpZipLib code (as I figured it would be the mosteasily readable open source implementation), and found that it terminates a block every 16k literals of output, ignoring the input. This is easy enough to implement, but it seems like there must be some more targetted approach, especially given the language in the spec "determines that starting a new block with fresh trees would be useful".
So does anyone have any ideas for new strategies, or examples of existing ones?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
作为让你继续前进的建议。
推测性的展望具有足够大小的缓冲区,以表明卓越的压缩值得进行更改。
这会改变流行为(在输出发生之前需要输入更多数据)并使刷新等操作显着复杂化。 这也是压桩中相当大的额外负载。
在一般情况下,只需在可以开始新块的每个点进行分支,根据需要递归两个分支直到采用所有路径,就可以确保产生最佳输出。 有嵌套行为的路径获胜。 这对于重要的输入大小不太可能是可行的,因为何时开始新块的选择是如此开放。
简单地将其限制为最少 8K 输出文字,但防止块中超过 32K 文字,将为尝试推测算法提供相对容易处理的基础。 称8K为子块。
最简单的是(伪代码):
OVERHEAD 是一些常量,用于说明切换块的成本。
这很粗糙,并且可能会得到改进,但如果没有其他的话,这是分析的开始。 检测代码以获取有关导致切换的原因的信息,使用它来确定更改可能有益的良好启发(也许压缩比已显着下降)。
这可能导致仅当启发式认为合理时才进行specChange 的构建。 如果启发法被证明是一个强有力的指标,那么您就可以消除投机性质,并简单地决定无论如何都在该点进行交换。
As a suggestion to get you going.
A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.
This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.
In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.
Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.
The simplest of which would be (pseudo code):
OVERHEAD is some constant to account for the cost of switching over blocks
This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).
This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.
嗯,我喜欢一些启发式分析的想法,尝试提出一些“规则”,以决定何时结束阻止可能是有益的。 今晚我会研究一下你建议的方法,看看我能用它做什么。
与此同时,我突然想到,为了在这个问题上做出充分知情的选择,我需要更好地了解区块大小决策的利弊。 我很快就发现,较小的块可以让你拥有一个可能更好的目标符号字母表——但代价是更频繁地定义树会增加开销。 较大的块可以通过规模效率来对抗更通用的符号字母表(只需一棵树即可存储和解码大量编码数据)。
在我的脑海中,文字代码与长度、距离代码的相对分布是否会对最佳块大小产生特定影响尚不清楚。 不过值得深思。
Hmm, I like the idea of some heuristic analysis to try to come up with some "rules" for when ending the block might be beneficial. I will look into your suggested approach tonight, and see what I could do with it.
In the meantime, it occurs to me that in order to make a fully informed choice on the issue, I need a better mental picture of the pros and cons of block size decisions. Really quickly I get that smaller blocks allow you to have a potentially better targeted symbol alphabet -- at the cost of increased overhead from defining trees more often. Larger blocks counter their more general symbol alphabet with efficiences of scale (only one tree to store and decode for lots of encoded data).
Off the top of my head, It's not apparent whether the relative distribution of litteral codes vs. length,distance codes would have a specific impact on optimal block size. Good food for thought though.