如何快速解码哈夫曼码?

发布于 2024-08-21 12:17:25 字数 553 浏览 10 评论 0原文

我已经实施了 一个简单的压缩器在Windows下使用纯哈夫曼代码。但是我不太了解如何快速解码压缩文件,我的糟糕算法是:

枚举代码表中的所有哈夫曼代码,然后将其与中的位进行比较结果是可怕的结果:解压3MB的文件需要6个小时。

你能提供一个更有效的算法吗?我应该使用哈希还是其他算法?

更新: 我已经实现了 带状态表的解码器,根据我朋友Lin的建议。我认为这个方法应该比遍历哈夫曼树更好,6秒内3MB。

谢谢。

I have implementated a simple compressor using pure huffman code under Windows.But I do not know much about how to decode the compressed file quickly,my bad algorithm is:

Enumerate all the huffman code in the code table then compare it with the bits in the compressed file.It turns out horrible result:decompressing 3MB file would need 6 hours.

Could you provide a much more efficient algorithm?Should I use Hash or something?

Update:
I have implementated the decoder with state table,based on my friend Lin's advice.I think this method should be better than travesal huffman tree,3MB within 6s.

thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

冰葑 2024-08-28 12:17:25

优化二叉树方法的一种方法是使用查找表。您可以排列该表,以便可以直接查找特定的编码位模式,从而允许任何代码的最大可能位宽度。

由于大多数代码不使用完整的最大宽度,因此它们包含在表中的多个位置 - 一个位置对应未使用位的每个组合。该表指示从输入以及解码输出中丢弃多少位。

如果最长的代码太长,那么该表不切实际,折衷方案是使用较小的固定宽度下标查找树。例如,您可以使用 256 项表来处理一个字节。如果输入代码超过 8 位,则表条目指示解码不完整,并引导您到处理接下来最多 8 位的表。较大的表以内存换取速度 - 256 个项目可能太小了。

我相信这种通用方法称为“前缀表”,这就是 BobMcGees 引用的代码正在做的事情。一个可能的区别是,某些压缩算法需要在解压缩期间更新前缀表 - 这对于简单的霍夫曼来说是不需要的。 IIRC,我第一次在一本关于位图图形文件格式(包括 GIF)的书中看到它,是在专利恐慌之前的一段时间。

应该很容易从二叉树模型中预先计算完整的查找表、等效哈希表或小表树。二叉树仍然是代码如何工作的关键表示(心理模型) - 这个查找表只是实现它的优化方法。

One way to optimise the binary-tree approach is to use a lookup table. You arrange the table so that you can look up a particular encoded bit-pattern directly, allowing for the maximum possible bit-width of any code.

Since most codes don't use the full maximum width, they are included at multiple locations in the table - one location for each combination of the unused bits. The table indicates how many bits to discard from the input as well as the decoded output.

If the longest code is too long, so the table is impractical, a compromise is to use a tree of smaller fixed-width-subscript lookups. For example, you can use a 256-item table to handle a byte. If the input code is more than 8 bits, the table entry indicates that decoding is incomplete and directs you to a table that handles the next up-to 8 bits. Larger tables trade memory for speed - 256 items is probably too small.

I believe this general approach is called "prefix tables", and is what BobMcGees quoted code is doing. A likely difference is that some compression algorithms require the prefix table to be updated during decompression - this is not needed for simple Huffman. IIRC, I first saw it in a book about bitmapped graphics file formats which included GIF, some time before the patent panic.

It should be easy to precalculate either a full lookup table, a hashtable equivalent, or a tree-of-small-tables from a binary tree model. The binary tree is still the key representation (mental model) of how the code works - this lookup table is just an optimised way to implement it.

梦里泪两行 2024-08-28 12:17:25

为什么不看看 GZIP 源 是如何做到的,特别是专门解压中的 Huffman 解压代码。 c?它所做的正是你所做的,只是速度要快得多。

据我所知,它使用查找数组和对整个单词进行移位/掩码操作来运行得更快。不过代码相当密集。

编辑:这是完整的来源

/* unpack.c -- decompress files in pack format.
 * Copyright (C) 1992-1993 Jean-loup Gailly
 * This is free software; you can redistribute it and/or modify it under the
 * terms of the GNU General Public License, see the file COPYING.
 */

#ifdef RCSID
static char rcsid[] = "$Id: unpack.c,v 1.4 1993/06/11 19:25:36 jloup Exp $";
#endif

#include "tailor.h"
#include "gzip.h"
#include "crypt.h"

#define MIN(a,b) ((a) <= (b) ? (a) : (b))
/* The arguments must not have side effects. */

#define MAX_BITLEN 25
/* Maximum length of Huffman codes. (Minor modifications to the code
 * would be needed to support 32 bits codes, but pack never generates
 * more than 24 bits anyway.)
 */

#define LITERALS 256
/* Number of literals, excluding the End of Block (EOB) code */

#define MAX_PEEK 12
/* Maximum number of 'peek' bits used to optimize traversal of the
 * Huffman tree.
 */

local ulg orig_len;       /* original uncompressed length */
local int max_len;        /* maximum bit length of Huffman codes */

local uch literal[LITERALS];
/* The literal bytes present in the Huffman tree. The EOB code is not
 * represented.
 */

local int lit_base[MAX_BITLEN+1];
/* All literals of a given bit length are contiguous in literal[] and
 * have contiguous codes. literal[code+lit_base[len]] is the literal
 * for a code of len bits.
 */

local int leaves [MAX_BITLEN+1]; /* Number of leaves for each bit length */
local int parents[MAX_BITLEN+1]; /* Number of parents for each bit length */

local int peek_bits; /* Number of peek bits currently used */

/* local uch prefix_len[1 << MAX_PEEK]; */
#define prefix_len outbuf
/* For each bit pattern b of peek_bits bits, prefix_len[b] is the length
 * of the Huffman code starting with a prefix of b (upper bits), or 0
 * if all codes of prefix b have more than peek_bits bits. It is not
 * necessary to have a huge table (large MAX_PEEK) because most of the
 * codes encountered in the input stream are short codes (by construction).
 * So for most codes a single lookup will be necessary.
 */
#if (1<<MAX_PEEK) > OUTBUFSIZ
    error cannot overlay prefix_len and outbuf
#endif

local ulg bitbuf;
/* Bits are added on the low part of bitbuf and read from the high part. */

local int valid;                  /* number of valid bits in bitbuf */
/* all bits above the last valid bit are always zero */

/* Set code to the next 'bits' input bits without skipping them. code
 * must be the name of a simple variable and bits must not have side effects.
 * IN assertions: bits <= 25 (so that we still have room for an extra byte
 * when valid is only 24), and mask = (1<<bits)-1.
 */
#define look_bits(code,bits,mask) \
{ \
  while (valid < (bits)) bitbuf = (bitbuf<<8) | (ulg)get_byte(), valid += 8; \
  code = (bitbuf >> (valid-(bits))) & (mask); \
}

/* Skip the given number of bits (after having peeked at them): */
#define skip_bits(bits)  (valid -= (bits))

#define clear_bitbuf() (valid = 0, bitbuf = 0)

/* Local functions */

local void read_tree  OF((void));
local void build_tree OF((void));

/* ===========================================================================
 * Read the Huffman tree.
 */
local void read_tree()
{
    int len;  /* bit length */
    int base; /* base offset for a sequence of leaves */
    int n;

    /* Read the original input size, MSB first */
    orig_len = 0;
    for (n = 1; n <= 4; n++) orig_len = (orig_len << 8) | (ulg)get_byte();

    max_len = (int)get_byte(); /* maximum bit length of Huffman codes */
    if (max_len > MAX_BITLEN) {
    error("invalid compressed data -- Huffman code > 32 bits");
    }

    /* Get the number of leaves at each bit length */
    n = 0;
    for (len = 1; len <= max_len; len++) {
    leaves[len] = (int)get_byte();
    n += leaves[len];
    }
    if (n > LITERALS) {
    error("too many leaves in Huffman tree");
    }
    Trace((stderr, "orig_len %ld, max_len %d, leaves %d\n",
       orig_len, max_len, n));
    /* There are at least 2 and at most 256 leaves of length max_len.
     * (Pack arbitrarily rejects empty files and files consisting of
     * a single byte even repeated.) To fit the last leaf count in a
     * byte, it is offset by 2. However, the last literal is the EOB
     * code, and is not transmitted explicitly in the tree, so we must
     * adjust here by one only.
     */
    leaves[max_len]++;

    /* Now read the leaves themselves */
    base = 0;
    for (len = 1; len <= max_len; len++) {
    /* Remember where the literals of this length start in literal[] : */
    lit_base[len] = base;
    /* And read the literals: */
    for (n = leaves[len]; n > 0; n--) {
        literal[base++] = (uch)get_byte();
    }
    }
    leaves[max_len]++; /* Now include the EOB code in the Huffman tree */
}

/* ===========================================================================
 * Build the Huffman tree and the prefix table.
 */
local void build_tree()
{
    int nodes = 0; /* number of nodes (parents+leaves) at current bit length */
    int len;       /* current bit length */
    uch *prefixp;  /* pointer in prefix_len */

    for (len = max_len; len >= 1; len--) {
    /* The number of parent nodes at this level is half the total
     * number of nodes at parent level:
     */
    nodes >>= 1;
    parents[len] = nodes;
    /* Update lit_base by the appropriate bias to skip the parent nodes
     * (which are not represented in the literal array):
     */
    lit_base[len] -= nodes;
    /* Restore nodes to be parents+leaves: */
    nodes += leaves[len];
    }
    /* Construct the prefix table, from shortest leaves to longest ones.
     * The shortest code is all ones, so we start at the end of the table.
     */
    peek_bits = MIN(max_len, MAX_PEEK);
    prefixp = &prefix_len[1<<peek_bits];
    for (len = 1; len <= peek_bits; len++) {
    int prefixes = leaves[len] << (peek_bits-len); /* may be 0 */
    while (prefixes--) *--prefixp = (uch)len;
    }
    /* The length of all other codes is unknown: */
    while (prefixp > prefix_len) *--prefixp = 0;
}

/* ===========================================================================
 * Unpack in to out.  This routine does not support the old pack format
 * with magic header \037\037.
 *
 * IN assertions: the buffer inbuf contains already the beginning of
 *   the compressed data, from offsets inptr to insize-1 included.
 *   The magic header has already been checked. The output buffer is cleared.
 */
int unpack(in, out)
    int in, out;            /* input and output file descriptors */
{
    int len;                /* Bit length of current code */
    unsigned eob;           /* End Of Block code */
    register unsigned peek; /* lookahead bits */
    unsigned peek_mask;     /* Mask for peek_bits bits */

    ifd = in;
    ofd = out;

    read_tree();     /* Read the Huffman tree */
    build_tree();    /* Build the prefix table */
    clear_bitbuf();  /* Initialize bit input */
    peek_mask = (1<<peek_bits)-1;

    /* The eob code is the largest code among all leaves of maximal length: */
    eob = leaves[max_len]-1;
    Trace((stderr, "eob %d %x\n", max_len, eob));

    /* Decode the input data: */
    for (;;) {
    /* Since eob is the longest code and not shorter than max_len,
         * we can peek at max_len bits without having the risk of reading
         * beyond the end of file.
     */
    look_bits(peek, peek_bits, peek_mask);
    len = prefix_len[peek];
    if (len > 0) {
        peek >>= peek_bits - len; /* discard the extra bits */
    } else {
        /* Code of more than peek_bits bits, we must traverse the tree */
        ulg mask = peek_mask;
        len = peek_bits;
        do {
                len++, mask = (mask<<1)+1;
        look_bits(peek, len, mask);
        } while (peek < (unsigned)parents[len]);
        /* loop as long as peek is a parent node */
    }
    /* At this point, peek is the next complete code, of len bits */
    if (peek == eob && len == max_len) break; /* end of file? */
    put_ubyte(literal[peek+lit_base[len]]);
    Tracev((stderr,"%02d %04x %c\n", len, peek,
        literal[peek+lit_base[len]]));
    skip_bits(len);
    } /* for (;;) */

    flush_window();
    Trace((stderr, "bytes_out %ld\n", bytes_out));
    if (orig_len != (ulg)bytes_out) {
    error("invalid compressed data--length error");
    }
    return OK;
}

Why not take a look at how the GZIP source does it, specifically the Huffman decompression code in specifically unpack.c? It's doing exactly what you are, except it's doing it much, much faster.

From what I can tell, it's using a lookup array and shift/mask operations operating on whole words to run faster. Pretty dense code though.

EDIT: here is the complete source

/* unpack.c -- decompress files in pack format.
 * Copyright (C) 1992-1993 Jean-loup Gailly
 * This is free software; you can redistribute it and/or modify it under the
 * terms of the GNU General Public License, see the file COPYING.
 */

#ifdef RCSID
static char rcsid[] = "$Id: unpack.c,v 1.4 1993/06/11 19:25:36 jloup Exp $";
#endif

#include "tailor.h"
#include "gzip.h"
#include "crypt.h"

#define MIN(a,b) ((a) <= (b) ? (a) : (b))
/* The arguments must not have side effects. */

#define MAX_BITLEN 25
/* Maximum length of Huffman codes. (Minor modifications to the code
 * would be needed to support 32 bits codes, but pack never generates
 * more than 24 bits anyway.)
 */

#define LITERALS 256
/* Number of literals, excluding the End of Block (EOB) code */

#define MAX_PEEK 12
/* Maximum number of 'peek' bits used to optimize traversal of the
 * Huffman tree.
 */

local ulg orig_len;       /* original uncompressed length */
local int max_len;        /* maximum bit length of Huffman codes */

local uch literal[LITERALS];
/* The literal bytes present in the Huffman tree. The EOB code is not
 * represented.
 */

local int lit_base[MAX_BITLEN+1];
/* All literals of a given bit length are contiguous in literal[] and
 * have contiguous codes. literal[code+lit_base[len]] is the literal
 * for a code of len bits.
 */

local int leaves [MAX_BITLEN+1]; /* Number of leaves for each bit length */
local int parents[MAX_BITLEN+1]; /* Number of parents for each bit length */

local int peek_bits; /* Number of peek bits currently used */

/* local uch prefix_len[1 << MAX_PEEK]; */
#define prefix_len outbuf
/* For each bit pattern b of peek_bits bits, prefix_len[b] is the length
 * of the Huffman code starting with a prefix of b (upper bits), or 0
 * if all codes of prefix b have more than peek_bits bits. It is not
 * necessary to have a huge table (large MAX_PEEK) because most of the
 * codes encountered in the input stream are short codes (by construction).
 * So for most codes a single lookup will be necessary.
 */
#if (1<<MAX_PEEK) > OUTBUFSIZ
    error cannot overlay prefix_len and outbuf
#endif

local ulg bitbuf;
/* Bits are added on the low part of bitbuf and read from the high part. */

local int valid;                  /* number of valid bits in bitbuf */
/* all bits above the last valid bit are always zero */

/* Set code to the next 'bits' input bits without skipping them. code
 * must be the name of a simple variable and bits must not have side effects.
 * IN assertions: bits <= 25 (so that we still have room for an extra byte
 * when valid is only 24), and mask = (1<<bits)-1.
 */
#define look_bits(code,bits,mask) \
{ \
  while (valid < (bits)) bitbuf = (bitbuf<<8) | (ulg)get_byte(), valid += 8; \
  code = (bitbuf >> (valid-(bits))) & (mask); \
}

/* Skip the given number of bits (after having peeked at them): */
#define skip_bits(bits)  (valid -= (bits))

#define clear_bitbuf() (valid = 0, bitbuf = 0)

/* Local functions */

local void read_tree  OF((void));
local void build_tree OF((void));

/* ===========================================================================
 * Read the Huffman tree.
 */
local void read_tree()
{
    int len;  /* bit length */
    int base; /* base offset for a sequence of leaves */
    int n;

    /* Read the original input size, MSB first */
    orig_len = 0;
    for (n = 1; n <= 4; n++) orig_len = (orig_len << 8) | (ulg)get_byte();

    max_len = (int)get_byte(); /* maximum bit length of Huffman codes */
    if (max_len > MAX_BITLEN) {
    error("invalid compressed data -- Huffman code > 32 bits");
    }

    /* Get the number of leaves at each bit length */
    n = 0;
    for (len = 1; len <= max_len; len++) {
    leaves[len] = (int)get_byte();
    n += leaves[len];
    }
    if (n > LITERALS) {
    error("too many leaves in Huffman tree");
    }
    Trace((stderr, "orig_len %ld, max_len %d, leaves %d\n",
       orig_len, max_len, n));
    /* There are at least 2 and at most 256 leaves of length max_len.
     * (Pack arbitrarily rejects empty files and files consisting of
     * a single byte even repeated.) To fit the last leaf count in a
     * byte, it is offset by 2. However, the last literal is the EOB
     * code, and is not transmitted explicitly in the tree, so we must
     * adjust here by one only.
     */
    leaves[max_len]++;

    /* Now read the leaves themselves */
    base = 0;
    for (len = 1; len <= max_len; len++) {
    /* Remember where the literals of this length start in literal[] : */
    lit_base[len] = base;
    /* And read the literals: */
    for (n = leaves[len]; n > 0; n--) {
        literal[base++] = (uch)get_byte();
    }
    }
    leaves[max_len]++; /* Now include the EOB code in the Huffman tree */
}

/* ===========================================================================
 * Build the Huffman tree and the prefix table.
 */
local void build_tree()
{
    int nodes = 0; /* number of nodes (parents+leaves) at current bit length */
    int len;       /* current bit length */
    uch *prefixp;  /* pointer in prefix_len */

    for (len = max_len; len >= 1; len--) {
    /* The number of parent nodes at this level is half the total
     * number of nodes at parent level:
     */
    nodes >>= 1;
    parents[len] = nodes;
    /* Update lit_base by the appropriate bias to skip the parent nodes
     * (which are not represented in the literal array):
     */
    lit_base[len] -= nodes;
    /* Restore nodes to be parents+leaves: */
    nodes += leaves[len];
    }
    /* Construct the prefix table, from shortest leaves to longest ones.
     * The shortest code is all ones, so we start at the end of the table.
     */
    peek_bits = MIN(max_len, MAX_PEEK);
    prefixp = &prefix_len[1<<peek_bits];
    for (len = 1; len <= peek_bits; len++) {
    int prefixes = leaves[len] << (peek_bits-len); /* may be 0 */
    while (prefixes--) *--prefixp = (uch)len;
    }
    /* The length of all other codes is unknown: */
    while (prefixp > prefix_len) *--prefixp = 0;
}

/* ===========================================================================
 * Unpack in to out.  This routine does not support the old pack format
 * with magic header \037\037.
 *
 * IN assertions: the buffer inbuf contains already the beginning of
 *   the compressed data, from offsets inptr to insize-1 included.
 *   The magic header has already been checked. The output buffer is cleared.
 */
int unpack(in, out)
    int in, out;            /* input and output file descriptors */
{
    int len;                /* Bit length of current code */
    unsigned eob;           /* End Of Block code */
    register unsigned peek; /* lookahead bits */
    unsigned peek_mask;     /* Mask for peek_bits bits */

    ifd = in;
    ofd = out;

    read_tree();     /* Read the Huffman tree */
    build_tree();    /* Build the prefix table */
    clear_bitbuf();  /* Initialize bit input */
    peek_mask = (1<<peek_bits)-1;

    /* The eob code is the largest code among all leaves of maximal length: */
    eob = leaves[max_len]-1;
    Trace((stderr, "eob %d %x\n", max_len, eob));

    /* Decode the input data: */
    for (;;) {
    /* Since eob is the longest code and not shorter than max_len,
         * we can peek at max_len bits without having the risk of reading
         * beyond the end of file.
     */
    look_bits(peek, peek_bits, peek_mask);
    len = prefix_len[peek];
    if (len > 0) {
        peek >>= peek_bits - len; /* discard the extra bits */
    } else {
        /* Code of more than peek_bits bits, we must traverse the tree */
        ulg mask = peek_mask;
        len = peek_bits;
        do {
                len++, mask = (mask<<1)+1;
        look_bits(peek, len, mask);
        } while (peek < (unsigned)parents[len]);
        /* loop as long as peek is a parent node */
    }
    /* At this point, peek is the next complete code, of len bits */
    if (peek == eob && len == max_len) break; /* end of file? */
    put_ubyte(literal[peek+lit_base[len]]);
    Tracev((stderr,"%02d %04x %c\n", len, peek,
        literal[peek+lit_base[len]]));
    skip_bits(len);
    } /* for (;;) */

    flush_window();
    Trace((stderr, "bytes_out %ld\n", bytes_out));
    if (orig_len != (ulg)bytes_out) {
    error("invalid compressed data--length error");
    }
    return OK;
}
多情出卖 2024-08-28 12:17:25

解压缩霍夫曼码的典型方法是使用二叉树。您将代码插入树中,以便代码中的每个位代表左侧 (0) 或右侧 (1) 的分支,并在叶子中包含解码后的字节(或您拥有的任何值)。

解码就是从编码内容中读取位,遍历树中的每个位。当到达叶子时,发出解码后的值,并继续读取,直到输入耗尽。

更新: 此页面描述了该技术,并且具有奇特的功能图形。

The typical way to decompress a Huffman code is using a binary tree. You insert your codes in the tree, so that each bit in a code represents a branch either to the left (0) or right (1), with decoded bytes (or whatever values you have) in the leaves.

Decoding is then just a case of reading bits from the coded content, walking the tree for each bit. When you reach a leaf, emit that decoded value, and keep reading until the input is exhausted.

Update: this page describes the technique, and has fancy graphics.

南冥有猫 2024-08-28 12:17:25

您可以在通常的霍夫曼树查找上执行一种批量查找:

  1. 选择位深度(称为深度n);这是速度、内存和构建表的时间投入之间的权衡;
  2. 为长度为 n 的所有 2^n 位字符串构建查找表。每个条目可能编码几个完整的令牌;通常还会剩下一些位,它们只是霍夫曼代码的前缀:对于其中的每一个,都建立一个指向该代码的进一步查找表的链接;
  3. 构建进一步的查找表。表的总数最多比霍夫曼树中编码的条目数少一。

选择四的倍数的深度(例如深度8)非常适合移位操作。

后记这与potatoswatter对unwind答案的评论中的想法以及Steve314在使用多个表中的答案不同:这意味着所有n位查找都被使用,因此应该更快,但会使表构建和查找变得更加棘手,并且对于给定深度将消耗更多空间。

You can perform a kind of batch lookup on the usual Huffmann tree lookup:

  1. Choosing a bit depth (call it depth n); this is a trade-off between speed, memory, and time investment to construct tables;
  2. Build a lookup table for all 2^n bit strings of length n. Each entry may encode several complete tokens; there will commonly also be some bits left over that are only a prefix of Huffman codes: for each of these, make a link to a further lookup table for that code;
  3. Build the further lookup tables. The total number of tables is at most one less than the number of entries coded in the Huffmann tree.

Choosing a depth that is a multiple of four, e.g., depth 8, is a good fit for bit shifting operations.

Postscript This differs from the idea in potatoswatter's comment on unwind's answer and from Steve314's answer in using multiple tables: this means that all of the n-bit lookup is put to use, so should be faster but makes table construction and lookup significantly trickier, and will consume much more space for a given depth.

揽月 2024-08-28 12:17:25

为什么不在同一个源模块中使用解压缩算法?这似乎是一个不错的算法。

Why not use the decompress algorithm in the same source module? It appears to be a decent algorithm.

寄居者 2024-08-28 12:17:25

其他答案是正确的,但这里是我最近编写的一些 Rust 代码,以使想法具体化。这是关键例程:

  fn decode( &self, input: &mut InpBitStream ) -> usize
  {
    let mut sym = self.lookup[ input.peek( self.peekbits ) ];
    if sym >= self.ncode
    {
      sym = self.lookup[ sym - self.ncode + ( input.peek( self.maxbits ) >> self.peekbits ) ];
    }  
    input.advance( self.nbits[ sym ] as usize );
    sym
  }

棘手的部分是设置查找表,请参阅 Rust 中完整的 RFC 1951 解码器中的 BitDecoder::setup_code:

// RFC 1951 inflate ( de-compress ).

pub fn inflate( data: &[u8] ) -> Vec<u8>
{
  let mut inp = InpBitStream::new( &data );
  let mut out = Vec::new();
  let _chk = inp.get_bits( 16 ); // Checksum
  loop
  {
    let last = inp.get_bit();
    let btype = inp.get_bits( 2 );
    match btype
    {
      2 => { do_dyn( &mut inp, &mut out ); }
      1 => { do_fixed( &mut inp, &mut out ); }
      0 => { do_copy( &mut inp, &mut out ); }
      _ => { }
    }
    if last != 0 { break; }
  }  
  out
}

fn do_dyn( inp: &mut InpBitStream, out: &mut Vec<u8> )
{
  let n_lit_code = 257 + inp.get_bits( 5 );
  let n_dist_code = 1 + inp.get_bits( 5 );
  let n_len_code = 4 + inp.get_bits( 4 );

  let mut len = LenDecoder::new( inp, n_len_code );

  let mut lit = BitDecoder::new( n_lit_code );
  len.get_lengths( inp, &mut lit.nbits );
  lit.init(); 

  let mut dist = BitDecoder::new( n_dist_code );
  len.get_lengths( inp, &mut dist.nbits );
  dist.init();

  loop
  {
    let x = lit.decode( inp );
    match x
    {
      0..=255 => { out.push( x as u8 ); }
      256 =>  { break; } 
      _ =>
      {
        let mc = x - 257;
        let length = MATCH_OFF[ mc ] + inp.get_bits( MATCH_EXTRA[ mc ] as usize );
        let dc = dist.decode( inp );
        let distance = DIST_OFF[ dc ] + inp.get_bits( DIST_EXTRA[ dc ] as usize );
        copy( out, distance, length ); 
      }
    }
  }
} // end do_dyn

fn copy( out: &mut Vec<u8>, distance: usize, mut length: usize )
{
  let mut i = out.len() - distance;
  while length > 0
  {
    out.push( out[ i ] );
    i += 1;
    length -= 1;
  }
}

/// Decode length-limited Huffman codes.
struct BitDecoder
{
  ncode: usize,
  nbits: Vec<u8>,
  maxbits: usize,
  peekbits: usize,
  lookup: Vec<usize>
}

impl BitDecoder
{
  fn new( ncode: usize ) -> BitDecoder
  {
    BitDecoder 
    { 
      ncode,
      nbits: vec![0; ncode],
      maxbits: 0,
      peekbits: 0,
      lookup: Vec::new()
    }
  }

  /// The key routine, will be called many times.
  fn decode( &self, input: &mut InpBitStream ) -> usize
  {
    let mut sym = self.lookup[ input.peek( self.peekbits ) ];
    if sym >= self.ncode
    {
      sym = self.lookup[ sym - self.ncode + ( input.peek( self.maxbits ) >> self.peekbits ) ];
    }  
    input.advance( self.nbits[ sym ] as usize );
    sym
  }

  fn init( &mut self )
  {
    let ncode = self.ncode;

    let mut max_bits : usize = 0; 
    for bp in &self.nbits 
    { 
      let bits = *bp as usize;
      if bits > max_bits { max_bits = bits; } 
    }

    self.maxbits = max_bits;
    self.peekbits = if max_bits > 8 { 8 } else { max_bits };
    self.lookup.resize( 1 << self.peekbits, 0 );

    // Code below is from rfc1951 page 7

    let mut bl_count : Vec<usize> = vec![ 0; max_bits + 1 ]; // the number of codes of length N, N >= 1.

    for i in 0..ncode { bl_count[ self.nbits[i] as usize ] += 1; }

    let mut next_code : Vec<usize> = vec![ 0; max_bits + 1 ];
    let mut code = 0; 
    bl_count[0] = 0;

    for i in 0..max_bits
    {
      code = ( code + bl_count[i] ) << 1;
      next_code[ i + 1 ] = code;
    }

    for i in 0..ncode
    {
      let len = self.nbits[ i ] as usize;
      if len != 0
      {
        self.setup_code( i, len, next_code[ len ] );
        next_code[ len ] += 1;
      }
    }
  }

  // Decoding is done using self.lookup ( see decode ). To keep the lookup table small,
  // codes longer than 8 bits are looked up in two peeks.

  fn setup_code( &mut self, sym: usize, len: usize, mut code: usize )
  {
    if len <= self.peekbits
    {
      let diff = self.peekbits - len;
      for i in code << diff .. (code << diff) + (1 << diff)
      {
        // bits are reversed to match InpBitStream::peek
        let r = reverse( i, self.peekbits );
        self.lookup[ r ] = sym;
      }
    } else {
      // Secondary lookup required.
      let peekbits2 = self.maxbits - self.peekbits;

      // Split code into peekbits portion ( key ) and remainder ( code).
      let diff1 = len - self.peekbits;
      let key = code >> diff1;
      code &= ( 1 << diff1 ) - 1;

      // Get the secondary lookup.
      let kr = reverse( key, self.peekbits );
      let mut base = self.lookup[ kr ];
      if base == 0 // Secondary lookup not yet allocated for this key.
      {
        base = self.lookup.len();
        self.lookup.resize( base + ( 1 << peekbits2 ), 0 );
        self.lookup[ kr ] = self.ncode + base;
      } else {
        base -= self.ncode;
      }

      // Set the secondary lookup values.
      let diff = self.maxbits - len;
      for i in code << diff .. (code << diff) + (1<<diff)
      { 
        let r = reverse( i, peekbits2 );
        self.lookup[ base + r ] = sym;
      }
    }    
  }
} // end impl BitDecoder

struct InpBitStream<'a>
{
  data: &'a [u8],
  pos: usize,
  buf: usize,
  got: usize, // Number of bits in buffer.
}

impl <'a> InpBitStream<'a>
{
  fn new( data: &'a [u8] ) -> InpBitStream
  {
    InpBitStream { data, pos: 0, buf: 1, got: 0 }
  } 

  fn peek( &mut self, n: usize ) -> usize
  {
    while self.got < n
    {
      if self.pos < self.data.len() 
      {
        self.buf |= ( self.data[ self.pos ] as usize ) << self.got;
      }
      self.pos += 1;
      self.got += 8;
    }
    self.buf & ( ( 1 << n ) - 1 )
  }

  fn advance( &mut self, n:usize )
  { 
    self.buf >>= n;
    self.got -= n;
  }

  fn get_bit( &mut self ) -> usize
  {
    if self.got == 0 { self.peek( 1 ); }
    let result = self.buf & 1;
    self.advance( 1 );
    result
  }

  fn get_bits( &mut self, n: usize ) -> usize
  { 
    let result = self.peek( n );
    self.advance( n );
    result
  }

  fn get_huff( &mut self, mut n: usize ) -> usize 
  { 
    let mut result = 0; 
    while n > 0
    { 
      result = ( result << 1 ) + self.get_bit(); 
      n -= 1;
    }
    result
  }

  fn clear_bits( &mut self )
  {
    self.got = 0;
  }
} //  end impl InpBitStream

/// Decode code lengths.
struct LenDecoder
{
  plenc: u8, // previous length code ( which can be repeated )
  rep: usize,   // repeat
  bd: BitDecoder,
}

/// Decodes an array of lengths. There are special codes for repeats, and repeats of zeros.
impl LenDecoder
{
  fn new( inp: &mut InpBitStream, n_len_code: usize ) -> LenDecoder
  {
    let mut result = LenDecoder { plenc: 0, rep:0, bd: BitDecoder::new( 19 ) };

    // Read the array of 3-bit code lengths from input.
    for i in 0..n_len_code 
    { 
      result.bd.nbits[ CLEN_ALPHABET[i] as usize ] = inp.get_bits(3) as u8; 
    }
    result.bd.init();
    result
  }

  // Per RFC1931 page 13, get array of code lengths.
  fn get_lengths( &mut self, inp: &mut InpBitStream, result: &mut Vec<u8> )
  {
    let n = result.len();
    let mut i = 0;
    while self.rep > 0 { result[i] = self.plenc; i += 1; self.rep -= 1; }
    while i < n
    { 
      let lenc = self.bd.decode( inp ) as u8;
      if lenc < 16 
      {
        result[i] = lenc; 
        i += 1; 
        self.plenc = lenc; 
      } else {
        if lenc == 16 { self.rep = 3 + inp.get_bits(2); }
        else if lenc == 17 { self.rep = 3 + inp.get_bits(3); self.plenc=0; }
        else if lenc == 18 { self.rep = 11 + inp.get_bits(7); self.plenc=0; } 
        while i < n && self.rep > 0 { result[i] = self.plenc; i += 1; self.rep -= 1; }
      }
    }
  } // end get_lengths
} // end impl LenDecoder

/// Reverse a string of bits.
pub fn reverse( mut x:usize, mut bits: usize ) -> usize
{ 
  let mut result: usize = 0; 
  while bits > 0
  {
    result = ( result << 1 ) | ( x & 1 ); 
    x >>= 1; 
    bits -= 1;
  } 
  result
} 

fn do_copy( inp: &mut InpBitStream, out: &mut Vec<u8> )
{
  inp.clear_bits(); // Discard any bits in the input buffer
  let mut n = inp.get_bits( 16 );
  let _n1 = inp.get_bits( 16 );
  while n > 0 { out.push( inp.data[ inp.pos ] ); n -= 1; inp.pos += 1; }
}

fn do_fixed( inp: &mut InpBitStream, out: &mut Vec<u8> ) // RFC1951 page 12.
{
  loop
  {
    // 0 to 23 ( 7 bits ) => 256 - 279; 48 - 191 ( 8 bits ) => 0 - 143; 
    // 192 - 199 ( 8 bits ) => 280 - 287; 400..511 ( 9 bits ) => 144 - 255
    let mut x = inp.get_huff( 7 ); 
    if x <= 23 
    { 
      x += 256; 
    } else {
      x = ( x << 1 ) + inp.get_bit();
      if x <= 191 { x -= 48; }
      else if x <= 199 { x += 88; }
      else { x = ( x << 1 ) + inp.get_bit() - 256; }
    }

    match x
    {
      0..=255 => { out.push( x as u8 ); }
      256 => { break; } 
      _ => // 257 <= x && x <= 285 
      { 
        x -= 257;
        let length = MATCH_OFF[x] + inp.get_bits( MATCH_EXTRA[ x ] as usize );
        let dcode = inp.get_huff( 5 );
        let distance = DIST_OFF[dcode] + inp.get_bits( DIST_EXTRA[dcode] as usize );
        copy( out, distance, length );
      }
    }
  }
} // end do_fixed

// RFC 1951 constants.

pub static CLEN_ALPHABET : [u8; 19] = [ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 ];

pub static MATCH_OFF : [usize; 30] = [ 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 
  67,83,99,115,  131,163,195,227, 258, 0xffff ];

pub static MATCH_EXTRA : [u8; 29] = [ 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 ];

pub static DIST_OFF : [usize; 30] = [ 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 
  1025,1537,2049,3073, 4097,6145,8193,12289, 16385,24577 ];

pub static DIST_EXTRA : [u8; 30] = [ 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 ];

Github 存储库位于此处

The other answers are right, but here is some code in Rust I wrote recently to make the ideas concrete. This is the key routine:

  fn decode( &self, input: &mut InpBitStream ) -> usize
  {
    let mut sym = self.lookup[ input.peek( self.peekbits ) ];
    if sym >= self.ncode
    {
      sym = self.lookup[ sym - self.ncode + ( input.peek( self.maxbits ) >> self.peekbits ) ];
    }  
    input.advance( self.nbits[ sym ] as usize );
    sym
  }

The tricky bit is setting up the lookup table, see BitDecoder::setup_code in this complete RFC 1951 decoder in Rust:

// RFC 1951 inflate ( de-compress ).

pub fn inflate( data: &[u8] ) -> Vec<u8>
{
  let mut inp = InpBitStream::new( &data );
  let mut out = Vec::new();
  let _chk = inp.get_bits( 16 ); // Checksum
  loop
  {
    let last = inp.get_bit();
    let btype = inp.get_bits( 2 );
    match btype
    {
      2 => { do_dyn( &mut inp, &mut out ); }
      1 => { do_fixed( &mut inp, &mut out ); }
      0 => { do_copy( &mut inp, &mut out ); }
      _ => { }
    }
    if last != 0 { break; }
  }  
  out
}

fn do_dyn( inp: &mut InpBitStream, out: &mut Vec<u8> )
{
  let n_lit_code = 257 + inp.get_bits( 5 );
  let n_dist_code = 1 + inp.get_bits( 5 );
  let n_len_code = 4 + inp.get_bits( 4 );

  let mut len = LenDecoder::new( inp, n_len_code );

  let mut lit = BitDecoder::new( n_lit_code );
  len.get_lengths( inp, &mut lit.nbits );
  lit.init(); 

  let mut dist = BitDecoder::new( n_dist_code );
  len.get_lengths( inp, &mut dist.nbits );
  dist.init();

  loop
  {
    let x = lit.decode( inp );
    match x
    {
      0..=255 => { out.push( x as u8 ); }
      256 =>  { break; } 
      _ =>
      {
        let mc = x - 257;
        let length = MATCH_OFF[ mc ] + inp.get_bits( MATCH_EXTRA[ mc ] as usize );
        let dc = dist.decode( inp );
        let distance = DIST_OFF[ dc ] + inp.get_bits( DIST_EXTRA[ dc ] as usize );
        copy( out, distance, length ); 
      }
    }
  }
} // end do_dyn

fn copy( out: &mut Vec<u8>, distance: usize, mut length: usize )
{
  let mut i = out.len() - distance;
  while length > 0
  {
    out.push( out[ i ] );
    i += 1;
    length -= 1;
  }
}

/// Decode length-limited Huffman codes.
struct BitDecoder
{
  ncode: usize,
  nbits: Vec<u8>,
  maxbits: usize,
  peekbits: usize,
  lookup: Vec<usize>
}

impl BitDecoder
{
  fn new( ncode: usize ) -> BitDecoder
  {
    BitDecoder 
    { 
      ncode,
      nbits: vec![0; ncode],
      maxbits: 0,
      peekbits: 0,
      lookup: Vec::new()
    }
  }

  /// The key routine, will be called many times.
  fn decode( &self, input: &mut InpBitStream ) -> usize
  {
    let mut sym = self.lookup[ input.peek( self.peekbits ) ];
    if sym >= self.ncode
    {
      sym = self.lookup[ sym - self.ncode + ( input.peek( self.maxbits ) >> self.peekbits ) ];
    }  
    input.advance( self.nbits[ sym ] as usize );
    sym
  }

  fn init( &mut self )
  {
    let ncode = self.ncode;

    let mut max_bits : usize = 0; 
    for bp in &self.nbits 
    { 
      let bits = *bp as usize;
      if bits > max_bits { max_bits = bits; } 
    }

    self.maxbits = max_bits;
    self.peekbits = if max_bits > 8 { 8 } else { max_bits };
    self.lookup.resize( 1 << self.peekbits, 0 );

    // Code below is from rfc1951 page 7

    let mut bl_count : Vec<usize> = vec![ 0; max_bits + 1 ]; // the number of codes of length N, N >= 1.

    for i in 0..ncode { bl_count[ self.nbits[i] as usize ] += 1; }

    let mut next_code : Vec<usize> = vec![ 0; max_bits + 1 ];
    let mut code = 0; 
    bl_count[0] = 0;

    for i in 0..max_bits
    {
      code = ( code + bl_count[i] ) << 1;
      next_code[ i + 1 ] = code;
    }

    for i in 0..ncode
    {
      let len = self.nbits[ i ] as usize;
      if len != 0
      {
        self.setup_code( i, len, next_code[ len ] );
        next_code[ len ] += 1;
      }
    }
  }

  // Decoding is done using self.lookup ( see decode ). To keep the lookup table small,
  // codes longer than 8 bits are looked up in two peeks.

  fn setup_code( &mut self, sym: usize, len: usize, mut code: usize )
  {
    if len <= self.peekbits
    {
      let diff = self.peekbits - len;
      for i in code << diff .. (code << diff) + (1 << diff)
      {
        // bits are reversed to match InpBitStream::peek
        let r = reverse( i, self.peekbits );
        self.lookup[ r ] = sym;
      }
    } else {
      // Secondary lookup required.
      let peekbits2 = self.maxbits - self.peekbits;

      // Split code into peekbits portion ( key ) and remainder ( code).
      let diff1 = len - self.peekbits;
      let key = code >> diff1;
      code &= ( 1 << diff1 ) - 1;

      // Get the secondary lookup.
      let kr = reverse( key, self.peekbits );
      let mut base = self.lookup[ kr ];
      if base == 0 // Secondary lookup not yet allocated for this key.
      {
        base = self.lookup.len();
        self.lookup.resize( base + ( 1 << peekbits2 ), 0 );
        self.lookup[ kr ] = self.ncode + base;
      } else {
        base -= self.ncode;
      }

      // Set the secondary lookup values.
      let diff = self.maxbits - len;
      for i in code << diff .. (code << diff) + (1<<diff)
      { 
        let r = reverse( i, peekbits2 );
        self.lookup[ base + r ] = sym;
      }
    }    
  }
} // end impl BitDecoder

struct InpBitStream<'a>
{
  data: &'a [u8],
  pos: usize,
  buf: usize,
  got: usize, // Number of bits in buffer.
}

impl <'a> InpBitStream<'a>
{
  fn new( data: &'a [u8] ) -> InpBitStream
  {
    InpBitStream { data, pos: 0, buf: 1, got: 0 }
  } 

  fn peek( &mut self, n: usize ) -> usize
  {
    while self.got < n
    {
      if self.pos < self.data.len() 
      {
        self.buf |= ( self.data[ self.pos ] as usize ) << self.got;
      }
      self.pos += 1;
      self.got += 8;
    }
    self.buf & ( ( 1 << n ) - 1 )
  }

  fn advance( &mut self, n:usize )
  { 
    self.buf >>= n;
    self.got -= n;
  }

  fn get_bit( &mut self ) -> usize
  {
    if self.got == 0 { self.peek( 1 ); }
    let result = self.buf & 1;
    self.advance( 1 );
    result
  }

  fn get_bits( &mut self, n: usize ) -> usize
  { 
    let result = self.peek( n );
    self.advance( n );
    result
  }

  fn get_huff( &mut self, mut n: usize ) -> usize 
  { 
    let mut result = 0; 
    while n > 0
    { 
      result = ( result << 1 ) + self.get_bit(); 
      n -= 1;
    }
    result
  }

  fn clear_bits( &mut self )
  {
    self.got = 0;
  }
} //  end impl InpBitStream

/// Decode code lengths.
struct LenDecoder
{
  plenc: u8, // previous length code ( which can be repeated )
  rep: usize,   // repeat
  bd: BitDecoder,
}

/// Decodes an array of lengths. There are special codes for repeats, and repeats of zeros.
impl LenDecoder
{
  fn new( inp: &mut InpBitStream, n_len_code: usize ) -> LenDecoder
  {
    let mut result = LenDecoder { plenc: 0, rep:0, bd: BitDecoder::new( 19 ) };

    // Read the array of 3-bit code lengths from input.
    for i in 0..n_len_code 
    { 
      result.bd.nbits[ CLEN_ALPHABET[i] as usize ] = inp.get_bits(3) as u8; 
    }
    result.bd.init();
    result
  }

  // Per RFC1931 page 13, get array of code lengths.
  fn get_lengths( &mut self, inp: &mut InpBitStream, result: &mut Vec<u8> )
  {
    let n = result.len();
    let mut i = 0;
    while self.rep > 0 { result[i] = self.plenc; i += 1; self.rep -= 1; }
    while i < n
    { 
      let lenc = self.bd.decode( inp ) as u8;
      if lenc < 16 
      {
        result[i] = lenc; 
        i += 1; 
        self.plenc = lenc; 
      } else {
        if lenc == 16 { self.rep = 3 + inp.get_bits(2); }
        else if lenc == 17 { self.rep = 3 + inp.get_bits(3); self.plenc=0; }
        else if lenc == 18 { self.rep = 11 + inp.get_bits(7); self.plenc=0; } 
        while i < n && self.rep > 0 { result[i] = self.plenc; i += 1; self.rep -= 1; }
      }
    }
  } // end get_lengths
} // end impl LenDecoder

/// Reverse a string of bits.
pub fn reverse( mut x:usize, mut bits: usize ) -> usize
{ 
  let mut result: usize = 0; 
  while bits > 0
  {
    result = ( result << 1 ) | ( x & 1 ); 
    x >>= 1; 
    bits -= 1;
  } 
  result
} 

fn do_copy( inp: &mut InpBitStream, out: &mut Vec<u8> )
{
  inp.clear_bits(); // Discard any bits in the input buffer
  let mut n = inp.get_bits( 16 );
  let _n1 = inp.get_bits( 16 );
  while n > 0 { out.push( inp.data[ inp.pos ] ); n -= 1; inp.pos += 1; }
}

fn do_fixed( inp: &mut InpBitStream, out: &mut Vec<u8> ) // RFC1951 page 12.
{
  loop
  {
    // 0 to 23 ( 7 bits ) => 256 - 279; 48 - 191 ( 8 bits ) => 0 - 143; 
    // 192 - 199 ( 8 bits ) => 280 - 287; 400..511 ( 9 bits ) => 144 - 255
    let mut x = inp.get_huff( 7 ); 
    if x <= 23 
    { 
      x += 256; 
    } else {
      x = ( x << 1 ) + inp.get_bit();
      if x <= 191 { x -= 48; }
      else if x <= 199 { x += 88; }
      else { x = ( x << 1 ) + inp.get_bit() - 256; }
    }

    match x
    {
      0..=255 => { out.push( x as u8 ); }
      256 => { break; } 
      _ => // 257 <= x && x <= 285 
      { 
        x -= 257;
        let length = MATCH_OFF[x] + inp.get_bits( MATCH_EXTRA[ x ] as usize );
        let dcode = inp.get_huff( 5 );
        let distance = DIST_OFF[dcode] + inp.get_bits( DIST_EXTRA[dcode] as usize );
        copy( out, distance, length );
      }
    }
  }
} // end do_fixed

// RFC 1951 constants.

pub static CLEN_ALPHABET : [u8; 19] = [ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 ];

pub static MATCH_OFF : [usize; 30] = [ 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 
  67,83,99,115,  131,163,195,227, 258, 0xffff ];

pub static MATCH_EXTRA : [u8; 29] = [ 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 ];

pub static DIST_OFF : [usize; 30] = [ 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 
  1025,1537,2049,3073, 4097,6145,8193,12289, 16385,24577 ];

pub static DIST_EXTRA : [u8; 30] = [ 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 ];

Github repository here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文