Google Protocol Buffers:ZigZag 编码
来自编码 - Protocol Buffers - Google 代码上的“签名类型”:
ZigZag 编码将有符号整数映射到无符号整数,因此绝对值较小(例如 -1)的数字也具有较小的 varint 编码值。它以通过正整数和负整数来回“之字形”的方式执行此操作,因此 -1 被编码为 1,1 被编码为 2,-2 被编码为 3,依此类推,正如您所见可以参见下表:
签名原始编码为 0 0 -1 1 1 2 -2 3 2147483647 4294967294 -2147483648 4294967295
换句话说,每个值 n 都使用编码
(n << 1) ^ (n >> 31)
对于sint32s,或者
(n << 1) ^ (n >> 63)
适用于 64 位版本。
(n << 1) ^ (n >> 31)
如何等于表中的内容?我知道这对于正数有用,但是对于-1 来说,它是如何工作的呢? -1 不是 1111 1111
,而 (n << 1)
不是 1111 1110
吗? (负数上的位移位在任何语言中都格式良好吗?)
尽管如此,使用公式并执行 (-1 << 1) ^ (-1 >> 31)
,假设32 位 int,我得到 1111 1111
,即 40 亿,而表认为我应该有 1。
From "Signed Types" on Encoding - Protocol Buffers - Google Code:
ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:
Signed Original Encoded As 0 0 -1 1 1 2 -2 3 2147483647 4294967294 -2147483648 4294967295
In other words, each value n is encoded using
(n << 1) ^ (n >> 31)
for sint32s, or
(n << 1) ^ (n >> 63)
for the 64-bit version.
How does (n << 1) ^ (n >> 31)
equal whats in the table? I understand that would work for positives, but how does that work for say, -1? Wouldn't -1 be 1111 1111
, and (n << 1)
be 1111 1110
? (Is bit-shifting on negatives well formed in any language?)
Nonetheless, using the fomula and doing (-1 << 1) ^ (-1 >> 31)
, assuming a 32-bit int, I get 1111 1111
, which is 4 billion, whereas the table thinks I should have 1.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
将负符号整数右移复制符号位,这样
,
这可能更容易以二进制形式可视化(此处为 8 位):
Shifting a negative signed integer to the right copies the sign bit, so that
Then,
This might be easier to visualise in binary (8 bit here):
考虑之字形映射的另一种方式是,它是符号和幅度表示的轻微扭曲。
在 Zig Zag 映射中,映射的最低有效位 (lsb) 指示值的符号:如果为 0,则原始值是非负的,如果为 1,则原始值是负的。
非负值只需左移一位,为 lsb 中的符号位腾出空间。
对于负值,您可以对数字的绝对值(大小)执行相同的左移一位,并简单地让 lsb 指示符号。例如,-1 可以映射到 0x03 或 0b00000011,其中 lsb 表示它是负数,并且 1 的大小左移 1 位。
这个符号和幅度表示的丑陋之处在于“负零”,映射为 0x01 或 0b00000001。这种零的变体“用完了”我们的一个值,并改变了我们可以用一表示的整数范围。我们可能希望将负零映射到 -2^63 的特殊情况,以便我们可以表示 [-2^63, 2^63) 的完整 64b 2 的补码范围。这意味着我们使用了一种有价值的单字节编码来表示一个值,该值很少用于针对小数值优化的编码中,并且我们引入了一种特殊情况,这是很糟糕的。
这就是符号和幅度表示中锯齿形扭曲发生的地方。符号位仍在 lsb 中,但对于负数,我们从幅度中减去 1,而不是特殊大小写的负零。现在,-1 映射到 0x01,-2^63 也有非特殊情况表示(即 - 幅度 2^63 - 1,左移一位,设置 lsb / 符号位,所有位设置为 1) 。
因此,考虑 Zig Zag 编码的另一种方式是,它是一种更智能的符号和幅度表示:幅度左移一位,符号位存储在 lsb 中,并从负数的幅度中减去 1。
使用您发布的无条件按位运算符实现这些转换比显式测试符号、特殊情况操作负值(例如 - 取反并减 1,或按位不)、移动幅度,然后显式设置要快LSB 符号位。然而,它们实际上是等效的,并且这种更明确的符号和幅度系列步骤可能更容易理解我们正在做这些事情的内容和原因。
我会警告您,C / C++ 中的位移位有符号值是不可移植的,应该避免。左移负值具有未定义的行为,右移负值具有实现定义的行为。即使左移正整数也可能产生未定义的行为(例如,如果移入符号位,可能会导致陷阱或更糟糕的情况)。因此,一般来说,不要在 C / C++ 中对有符号类型进行位移位。 “就说不吧。”
首先转换为类型的无符号版本,以便根据标准获得安全、定义明确的结果。这确实意味着您不会进行负值的算术移位(即,将符号位向右拖动)——只有逻辑移位,因此您需要调整逻辑来解决这一问题。
以下是 C 语言中 2 的补码 64b 整数的锯齿形映射的安全且可移植的版本:
请注意 XOR 右侧项中符号位的算术否定。这会产生非负数 0 或负数全 1 —— 就像符号位从 MSB 到 LSB 的算术移位一样。然后,XOR 有效地“撤消”/“重做”负值的 2 的补码减 1(即 - 1 的补码或逻辑负),而无需任何条件逻辑或进一步的数学运算。
Another way to think about zig zag mapping is that it is a slight twist on a sign and magnitude representation.
In zig zag mapping, the least significant bit (lsb) of the mapping indicates the sign of the value: if it's 0, then the original value is non-negative, if it's 1, then the original value is negative.
Non-negative values are simply left shifted one bit to make room for the sign bit in the lsb.
For negative values, you could do the same one bit left shift for the absolute value (magnitude) of the number and simply have the lsb indicate the sign. For example, -1 could map to 0x03 or 0b00000011, where the lsb indicates that it is negative and the magnitude of 1 is left shifted by 1 bit.
The ugly thing about this sign and magnitude representation is "negative zero," mapped as 0x01 or 0b00000001. This variant of zero "uses up" one of our values and shifts the range of integers we can represent by one. We probably want to special case map negative zero to -2^63, so that we can represent the full 64b 2's complement range of [-2^63, 2^63). That means we've used one of our valuable single byte encodings to represent a value that will very, very, very rarely be used in an encoding optimized for small magnitude numbers and we've introduced a special case, which is bad.
This is where zig zag's twist on this sign and magnitude representation happens. The sign bit is still in the lsb, but for negative numbers, we subtract one from the magnitude rather than special casing negative zero. Now, -1 maps to 0x01 and -2^63 has a non-special case representation too (i.e. - magnitude 2^63 - 1, left shifted one bit, with lsb / sign bit set, which is all bits set to 1s).
So, another way to think about zig zag encoding is that it is a smarter sign and magnitude representation: the magnitude is left shifted one bit, the sign bit is stored in the lsb, and 1 is subtracted from the magnitude of negative numbers.
It is faster to implement these transformations using the unconditional bit-wise operators that you posted rather than explicitly testing the sign, special case manipulating negative values (e.g. - negate and subtract 1, or bitwise not), shifting the magnitude, and then explicitly setting the lsb sign bit. However, they are equivalent in effect and this more explicit sign and magnitude series of steps might be easier to understand what and why we are doing these things.
I will warn you though that bit shifting signed values in C / C++ is not portable and should be avoided. Left shifting a negative value has undefined behavior and right shifting a negative value has implementation defined behavior. Even left shifting a positive integer can have undefined behavior (e.g. - if you shift into the sign bit it might cause a trap or something worse). So, in general, don't bit shift signed types in C / C++. "Just say no."
Cast first to the unsigned version of the type to have safe, well-defined results according to the standards. This does mean that you then won't have arithmetic shift of negative values (i.e. - dragging the sign bit to the right) -- only logical shift, so you need to adjust the logic to account for that.
Here are the safe and portable versions of the zig zag mappings for 2's complement 64b integers in C:
Note the arithmetic negation of the sign bit in the right hand term of the XORs. That yields either 0 for non-negatives or all 1's for negatives -- just like arithmetic shift of the sign bit from msb to lsb would do. The XOR then effectively "undoes" / "redoes" the 2's complementation minus 1 (i.e. - 1's complementation or logical negation) for negative values without any conditional logic or further math.
让我在讨论中添加我的两分钱。正如其他答案所指出的,锯齿形编码可以被认为是符号幅度扭曲。这一事实可用于实现适用于任意大小整数的转换函数。
例如,我在我的 Python 项目中使用了以下代码:
尽管 Python 的
int
没有固定的位宽,但这些函数仍然有效;相反,它是动态扩展的。然而,这种方法在编译语言中可能效率低下,因为它需要条件分支。Let me add my two cents to the discussion. As other answers noted, the zig-zag encoding can be thought as a sign-magnitude twist. This fact can be used to implement conversion functions which work for arbitrary-sized integers.
For example, I use the following code in one on my Python projects:
These functions work despite Python's
int
has no fixed bitwidth; instead, it extends dynamically. However, this approach may be inefficient in compiled languages since it requires conditional branching.