在 deflate 算法中确定块大小有哪些好的策略？

发布于 2024-07-13 08:09:56 字数 523 浏览 8 评论 0原文

我正在编写一个压缩库作为一个小项目，而且我已经足够成熟了（我的库可以提取任何标准 gzip 文件，并生成兼容的（但肯定还不是最佳的）gzip 输出），是时候弄清楚了制定一个有意义的块终止策略。目前，我只是在每 32k 个输入（LZ77 窗口大小）后切断块，因为它实现起来很方便且快速 - 现在我要回去并尝试实际提高压缩效率。

Deflate 规范对此只有这样的说法：“压缩器终止一个块当它确定用新鲜树开始一个新块会很有用时，或者当块大小填满压缩器的块缓冲区时”，这并没有多大帮助。

我对 SharpZipLib 代码进行了排序（因为我认为它将是最容易阅读的开源实现），发现它每 16k 输出文字就终止一个块，忽略输入。这很容易实现，但似乎必须有一些更有针对性的方法，特别是考虑到规范中的语言“确定用新鲜树开始一个新块将是有用的”。

那么有人对新策略有什么想法，或者现有策略的例子吗？

提前致谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

打小就很酷 2024-07-20 08:09:56

作为让你继续前进的建议。

推测性的展望具有足够大小的缓冲区，以表明卓越的压缩值得进行更改。

这会改变流行为（在输出发生之前需要输入更多数据）并使刷新等操作显着复杂化。这也是压桩中相当大的额外负载。

在一般情况下，只需在可以开始新块的每个点进行分支，根据需要递归两个分支直到采用所有路径，就可以确保产生最佳输出。有嵌套行为的路径获胜。这对于重要的输入大小不太可能是可行的，因为何时开始新块的选择是如此开放。

简单地将其限制为最少 8K 输出文字，但防止块中超过 32K 文字，将为尝试推测算法提供相对容易处理的基础。称8K为子块。

最简单的是（伪代码）：

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD 是一些常量，用于说明切换块的成本。

这很粗糙，并且可能会得到改进，但如果没有其他的话，这是分析的开始。检测代码以获取有关导致切换的原因的信息，使用它来确定更改可能有益的良好启发（也许压缩比已显着下降）。

这可能导致仅当启发式认为合理时才进行specChange 的构建。如果启发法被证明是一个强有力的指标，那么您就可以消除投机性质，并简单地决定无论如何都在该点进行交换。

As a suggestion to get you going.

A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.

This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.

In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.

Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.

The simplest of which would be (pseudo code):

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD is some constant to account for the cost of switching over blocks

This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).

This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.

回复收藏 0 原文