这种压缩算法有名字吗?
假设您有一个四字节整数,并且希望将其压缩为更少的字节。您可以压缩它,因为较小的值比较大的值更有可能(即,值的概率随着其大小而减小)。您应用以下方案来生成 1、2、3 或 4 字节结果:
请注意,在下面的描述中(这些位是基于 1 的,并且从最高有效位到最低有效位),即第一位指的是最有效位,第二位到下一个最高有效位,等等...)
- 如果 n<128,则将其编码为 第一位设置的单字节 为零
- 如果 n>=128 且 n<16,384 , 您使用一个两字节整数。你设定 第一位到一,表示 第二位为零。然后你 使用剩余的 14 位进行编码 数字 n。
- 如果 n>16,384 并且 n<2,097,152 ,您使用三字节 整数。您将第一位设置为 一,第二位到一,以及 第三位为零。您使用 剩余 21 位,用于编码 n。
- 如果n>2,097,152并且n<268,435,456, 您使用四字节整数。你设定 前三位到一和 第四位为零。您使用 剩余 28 位用于编码 n。
- 如果n≥268,435,456并且n<4,294,967,296, 您使用五字节整数。你设定 将前四位改为 1 并使用 以下32位来设置 n 的精确值,作为四字节 整数。其余位未被使用。
这个算法有名字吗?
Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
- If n<128, you encode it as a
single byte with the first bit set
to zero - If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n. - If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n. - If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n. - If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这非常接近可变长度数量编码或base-128。后一个名称源于这样一个事实:编码中的每个 7 位单元都可以被视为一个 base-128 数字。
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
听起来与 Dlugosz 的可变长度整数编码非常相似
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
您的方案类似于 UTF-8,它是用于 Unicode 文本数据的编码方案。
主要区别在于 UTF-8 流中的每个字节都指示它是前导字节还是尾随字节,因此可以从中间开始读取序列。对于您的方案,如果存储了一系列此类值,则丢失的前导字节将使文件的其余部分完全无法读取。并且读取这样的序列必须从头开始,而不是从任意位置开始。
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
霍夫曼编码是指使用更少的位来存储更多的常用数据,以换取使用更多的位来存储更少的数据通用数据。
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
字节对齐可变长度编码将值编码为可变数量的整个字节。
也许最常见的是 Varint,每个字节保留 1 位作为“连续位”。
特别是,给定每个编码字节具有单个“连续位”长度位(加上 7 个数据位)的任何格式,可以创建一个相关格式,其中相同位重新组织成稍微不同的格式。不同的安排:
在开头打包所有长度信令位(通常全部打包在第一个字节中),
并将数据位打包到第一个字节的其余部分(如果还有剩余空间)(以及后续字节的所有 8 位,如果有的话)。
有几个人使用了这种位的重新排列(有时与其他格式调整相结合)来加速解码,通常使用基于 SIMD 的算法,
例如:"基于SIMD的解码
发帖列表的数量”
开始处的长度信号位:
“英制 varint 将所有连续位放在编码值的开头”
Carl Mastrangelo:让我们制作一个 Varint
“Group Varint Encoding” 更进一步:
单字节前缀存储以下 4 个整数值的 2 位“长度”。
也称为 varint -GB。
流 VByte< /a> 更进一步,将所有“长度”存储在一个位置,并将所有数据字节存储在单独的位置。
Varint:每个字节上的长度信令位
使用每个字节的高位来指示“继续”或“停止”,其余位(序列中每个字节的 7 位)解释为普通位编码实际值的二进制文件:
这听起来像 Google 协议缓冲区。
压缩整数的相关方法
总之:这段代码将一个整数分为两部分:
一元代码中的第一部分指示需要多少位来读取值的其余部分,第二部分(以位为单位指示宽度)或多或少是纯二进制代码,用于对实际值进行编码。
这个特定的代码将一元代码与二进制代码“线程化”,但其他类似的代码首先打包完整的一元代码,然后再打包二进制代码,
例如Elias gamma编码。
我怀疑此代码是“开始/停止代码”家族之一
如以下所述:
Steven Pigeon — 开始/停止代码 — Procs。 2001 年数据压缩会议,IEEE 计算机协会出版社,2001 年。
Byte-aligned variable-length encodings encode values into variable numbers of whole bytes.
Perhaps the most common is Varint, with 1 bit per byte reserved as a "continuation bit".
In particular, given any format with a single "continuation bit" length-bit (plus 7 data bits) per encoded byte, it's possible to create a related format with the same bits re-organized into a slightly different arrangement:
pack all the length-signalling bits at the beginning (usually all packed in the first byte),
and pack the data bits into the rest of the first byte (if there's any room left) (and all 8 bits of the following bytes, if any).
Several people people have used such re-arrangement of bits (sometimes combined with other format tweaks) in order to speed up decoding, typically using SIMD-based algorithms,
such as: "SIMD-Based Decoding
of Posting Lists"
Length-signalling bits at start:
"Imperial varint places all of the continuation bits at the beginning of the encoded value"
Carl Mastrangelo: Let’s Make a Varint
"Group Varint Encoding" goes further:
a single byte prefix stores the 2-bit "lengths" for the following 4 integer values.
Also called varint-GB.
Stream VByte goes even further, storing all the "lengths" in one place, and all the data bytes in a separate location.
Varint: Length-signalling bit on each byte
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.