当前位置：文江博客话题详情

小字典的压缩算法

发布于 2024-11-05 15:13:50 字数 102 浏览 0 评论 0原文

我正在寻找一种适用于小于字节的符号的压缩算法。我对压缩算法进行了快速研究，但很难找出所使用符号的大小。无论如何，存在符号小于 8 位的流。 DEFLATE 是否有一个参数来定义其符号的大小？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无人问我粥可暖 2024-11-12 15:13:50

小于字节的明文符号

LZ77和LZ78的原始描述是用十进制数字序列（大约是字节大小一半的符号）来描述它们的。

如果你用谷歌搜索“DNA压缩算法”，你可以得到一堆专门用于压缩文件的算法的信息，这些文件几乎完全由4个字母AGCT组成，这是一个由4个符号组成的字典，每个符号大约有1/4那么小字节。
也许这些算法之一可能适合您，只需相对较少的调整。

LZMA 中使用的 LZ77 式压缩可能看起来对它压缩的前几个符号每个符号使用两个字节。
但是在压缩了几百个明文符号之后
（自然语言文本的字母，或十进制数字序列，或代表 DNA 碱基的 4 个字母的序列等），LZMA 推出的两字节压缩“块”通常代表十几个或更多的明文符号。
（我怀疑所有类似的算法都是如此，例如 DEFLATE 中使用的 LZ77 算法）。

如果您的文件仅使用远少于所有 256 个可能字节值的受限字母表，
原则上，程序员可以采用 DEFLATE（或其他算法）的变体，该变体可以利用有关该字母表的信息来生成比使用标准 DEFLATE 压缩的相同文件小几位的压缩文件。
然而，许多面向字节的文本压缩算法（LZ77、LZW、LZMA、DEFLATE 等）构建了常见长字符串的字典，并且可能会在自定义适应的百分之几内提供压缩性能（具有足够大的源文件）变体——使用标准压缩文件格式的优点通常值得牺牲百分之几的潜在空间节省。

小于一个字节的压缩符号

许多压缩算法，包括一些对基准文件提供最著名压缩的算法，逐位输出压缩信息（例如大多数 PAQ 系列压缩器，以及一些种算术编码器），而另一些则输出可变长度的压缩信息，而不考虑字节边界（例如霍夫曼压缩）。

描述算术编码的一些方式涉及被压缩为“少于一位信息”的信息片段，例如各个位或像素。

编辑：
“计数参数”解释了为什么不可能将所有可能的字节（更不用说所有可能的字节和一些常见的字节序列）压缩为长度小于 8 位的码字。
然而，一些压缩算法可以并且经常确实通过“牺牲”或“转义”不常见的字节来表示一些字节或（更罕见）一些字节序列，每个字节的码字长度小于 8 位由长度超过 8 位（包括“转义符”）的其他码字表示。

此类算法包括：

Pike 使用 4 位编码的文本压缩
面向字节的 Huffman
等几种组合将文件像 LZ77 一样解析为“符号”，其中每个符号代表一个字节序列，然后对这些符号进行霍夫曼压缩——例如 DEFLATE、LZX、LZH、LZHAM 等。

算法 Pike 算法使用 4 位“0101”来表示“e”（或在某些上下文中为“E”），使用 8 位“0000 0001”来表示单词“ the”（4 个字节，包括前面的空格）（或在某些上下文中为“The”或“THE”）等。
它有一个小词典，包含大约 200 个最常用的英语单词，
包括 16 个极其常见的英语单词的子词典。

当使用面向字节的霍夫曼编码压缩英文文本时，序列“e”（e 空间）被压缩为两个码字，总共通常为 6 位。

唉，当涉及到霍夫曼编码时，我无法告诉你那些“小”码字的确切大小，甚至无法告诉你一个小码字到底代表什么字节或字节序列，因为每个文件都是不同的。
通常，相同的代码字表示同一文件中不同位置的不同字节（或不同的字节序列）。
解码器根据压缩器在标头中留下的线索以及到目前为止解压缩的数据来决定码字代表哪个字节或字节序列。
对于范围编码或算术编码，“码字”甚至可能不是整数个比特。

plaintext symbols smaller than a byte

The original descriptions of LZ77 and LZ78 describe them in terms of a sequence of decimal digits (symbols that are approximately half the size of a byte).

If you google for "DNA compression algorithm", you can get a bunch of information on algorithms specialized for compression files that are almost entirely composed of the 4 letters A G C T, a dictionary of 4 symbols, each one about 1/4 as small as a byte.
Perhaps one of those algorithms might work for you with relatively little tweaking.

The LZ77-style compression used in LZMA may appear to use two bytes per symbol for the first few symbols that it compresses.
But after compressing a few hundred plaintext symbols
(the letters of natural-language text, or sequences of decimal digits, or sequences of the 4 letters that represent DNA bases, etc.), the two-byte compressed "chunks" that LZMA puts out often represent a dozen or more plaintext symbols.
(I suspect the same is true for all similar algorithms, such as the LZ77 algorithm used in DEFLATE).

If your files use only a restricted alphabet of much less than all 256 possible byte values,
in principle a programmer could adapt a variant of DEFLATE (or some other algorithm) that could make use of information about that alphabet to produce compressed files a few bits smaller in size than the same files compressed with standard DEFLATE.
However, many byte-oriented text compression algorithms -- LZ77, LZW, LZMA, DEFLATE, etc. build a dictionary of common long strings, and may give compression performance (with sufficiently large source file) within a few percent of that custom-adapted variant -- often the advantages of using a standard compressed file format is worth sacrificing a few percent of potential space savings.

compressed symbols smaller than a byte

Many compression algorithms, including some that give the best known compression on benchmark files, output compressed information bit-by-bit (such as most of the PAQ series of compressors, and some kinds of arithmetic coders), while others output variable-length compressed information without regard for byte boundaries (such as Huffman compression).

Some ways of describing arithmetic coding talk about pieces of information, such as individual bits or pixels, that are compressed to "less than one bit of information".

EDIT:
The "counting argument" explains why it's not possible to compress all possible bytes, much less all possible bytes and a few common sequences of bytes, into codewords that are all less than 8 bits long.
Nevertheless, several compression algorithms can and often do represent represent some bytes or (more rarely) some sequences of bytes, each with a codeword that is less than 8 bit long, by "sacrificing" or "escaping" less-common bytes that end up represented by other codewords that (including the "escape") are more than 8 bits long.

Such algorithms include:

The Pike Text compression using 4 bit coding
byte-oriented Huffman
several combination algorithms that do LZ77-like parsing of the file into "symbols", where each symbol represents a sequence of bytes, and then Huffman-compressing those symbols -- such as DEFLATE, LZX, LZH, LZHAM, etc.

The Pike algorithm uses the 4 bits "0101" to represent 'e' (or in some contexts 'E'), the 8 bits "0000 0001" to represent the word " the" (4 bytes, including the space before it) (or in some contexts " The" or " THE"), etc.
It has a small dictionary of about 200 of the most-frequent English words,
including a sub-dictionary of 16 extremely common English words.

When compressing English text with byte-oriented Huffman coding, the sequence "e " (e space) is compressed to two codewords with a total of typically 6 bits.

Alas, when Huffman coding is involved, I can't tell you the exact size of those "small" codewords, or even tell you exactly what byte or byte sequence a small codeword represents, because it is different for every file.
Often the same codeword represents a different byte (or different byte sequence) at different locations in the same file.
The decoder decides which byte or byte sequence a codeword represents based on clues left behind by the compressor in the headers, and on the data decompressed so far.
With range coding or arithmetic coding, the "codeword" may not even be an integer number of bits.

回复收藏 0 原文