当前位置：文江博客话题详情

这种压缩算法有名字吗？

发布于 2024-10-31 09:14:43 字数 628 浏览 1 评论 0原文

假设您有一个四字节整数，并且希望将其压缩为更少的字节。您可以压缩它，因为较小的值比较大的值更有可能（即，值的概率随着其大小而减小）。您应用以下方案来生成 1、2、3 或 4 字节结果：

请注意，在下面的描述中（这些位是基于 1 的，并且从最高有效位到最低有效位），即第一位指的是最有效位，第二位到下一个最高有效位，等等...）

如果 n<128，则将其编码为第一位设置的单字节为零
如果 n>=128 且 n<16,384 ，您使用一个两字节整数。你设定第一位到一，表示第二位为零。然后你使用剩余的 14 位进行编码数字 n。
如果 n>16,384 并且 n<2,097,152 ，您使用三字节整数。您将第一位设置为一，第二位到一，以及第三位为零。您使用剩余 21 位，用于编码 n。
如果n＞2,097,152并且n＜268,435,456，您使用四字节整数。你设定前三位到一和第四位为零。您使用剩余 28 位用于编码 n。
如果n≥268,435,456并且n＜4,294,967,296，您使用五字节整数。你设定将前四位改为 1 并使用以下32位来设置 n 的精确值，作为四字节整数。其余位未被使用。

这个算法有名字吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

菊凝晚露 2024-11-07 09:14:43

这非常接近可变长度数量编码或base-128。后一个名称源于这样一个事实：编码中的每个 7 位单元都可以被视为一个 base-128 数字。

回复收藏 0 原文

北音执念 2024-11-07 09:14:43

听起来与 Dlugosz 的可变长度整数编码非常相似

回复收藏 0 原文

北城挽邺 2024-11-07 09:14:43

您的方案类似于 UTF-8，它是用于 Unicode 文本数据的编码方案。

主要区别在于 UTF-8 流中的每个字节都指示它是前导字节还是尾随字节，因此可以从中间开始读取序列。对于您的方案，如果存储了一系列此类值，则丢失的前导字节将使文件的其余部分完全无法读取。并且读取这样的序列必须从头开始，而不是从任意位置开始。

回复收藏 0 原文

无名指的心愿 2024-11-07 09:14:43

霍夫曼编码是指使用更少的位来存储更多的常用数据，以换取使用更多的位来存储更少的数据通用数据。

回复收藏 0 原文

卸妝后依然美 2024-11-07 09:14:43

字节对齐可变长度编码将值编码为可变数量的整个字节。
也许最常见的是 Varint，每个字节保留 1 位作为“连续位”。

特别是，给定每个编码字节具有单个“连续位”长度位（加上 7 个数据位）的任何格式，可以创建一个相关格式，其中相同位重新组织成稍微不同的格式。不同的安排：
在开头打包所有长度信令位（通常全部打包在第一个字节中），
并将数据位打包到第一个字节的其余部分（如果还有剩余空间）（以及后续字节的所有 8 位，如果有的话）。

有几个人使用了这种位的重新排列（有时与其他格式调整相结合）来加速解码，通常使用基于 SIMD 的算法，
例如："基于SIMD的解码
发帖列表的数量”

开始处的长度信号位：

“英制 varint 将所有连续位放在编码值的开头”

Carl Mastrangelo：让我们制作一个 Varint

“Group Varint Encoding” 更进一步：
单字节前缀存储以下 4 个整数值的 2 位“长度”。
也称为 varint -GB。

流 VByte< /a> 更进一步，将所有“长度”存储在一个位置，并将所有数据字节存储在单独的位置。

Varint：每个字节上的长度信令位

使用每个字节的高位来指示“继续”或“停止”，其余位（序列中每个字节的 7 位）解释为普通位编码实际值的二进制文件：

这听起来像 Google 协议缓冲区。

压缩整数的相关方法

总之：这段代码将一个整数分为两部分：
一元代码中的第一部分指示需要多少位来读取值的其余部分，第二部分（以位为单位指示宽度）或多或少是纯二进制代码，用于对实际值进行编码。

这个特定的代码将一元代码与二进制代码“线程化”，但其他类似的代码首先打包完整的一元代码，然后再打包二进制代码，
例如Elias gamma编码。

我怀疑此代码是“开始/停止代码”家族之一
如以下所述：

Steven Pigeon — 开始/停止代码 — Procs。 2001 年数据压缩会议，IEEE 计算机协会出版社，2001 年。

Byte-aligned variable-length encodings encode values into variable numbers of whole bytes.
Perhaps the most common is Varint, with 1 bit per byte reserved as a "continuation bit".

In particular, given any format with a single "continuation bit" length-bit (plus 7 data bits) per encoded byte, it's possible to create a related format with the same bits re-organized into a slightly different arrangement:
pack all the length-signalling bits at the beginning (usually all packed in the first byte),
and pack the data bits into the rest of the first byte (if there's any room left) (and all 8 bits of the following bytes, if any).

Several people people have used such re-arrangement of bits (sometimes combined with other format tweaks) in order to speed up decoding, typically using SIMD-based algorithms,
such as: "SIMD-Based Decoding
of Posting Lists"

Length-signalling bits at start:

"Imperial varint places all of the continuation bits at the beginning of the encoded value"

Carl Mastrangelo: Let’s Make a Varint

"Group Varint Encoding" goes further:
a single byte prefix stores the 2-bit "lengths" for the following 4 integer values.
Also called varint-GB.

Stream VByte goes even further, storing all the "lengths" in one place, and all the data bytes in a separate location.

Varint: Length-signalling bit on each byte

Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:

This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.

related ways of compressing integers

In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.

This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.

I suspect this code is one of the family of "Start/Stop Codes"
as described in:

Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.

回复收藏 0 原文

~没有更多了~