我没有得到哥伦布/赖斯编码:它确实使输入有更多位,是吗?
或者,也许我不明白的是一元编码:
在Golomb 或 Rice 编码,通过将一个数字 N
除以另一个数字,将其分成两部分number M
,然后将该除法的整数结果编码为一元,余数编码为二进制。
在维基百科示例中,他们使用 42 作为 N
和 10作为M
,所以我们最终得到商q
为4(一元:1110)和余数r
为2(二进制010) ),因此生成的消息为 1110,010
,即 8 位(可以跳过逗号)。 42 的简单二进制表示是 101010,即 6 位。
对我来说,这似乎是由于 q
的一元表示形式总是必须比二进制更多。
显然,我在这里遗漏了一些重要的观点。 它是什么?
Or, maybe, what I don't get is unary coding:
In Golomb, or Rice, coding, you split a number N
into two parts by dividing it by another number M
and then code the integer result of that division in unary and the remainder in binary.
In the Wikipedia example, they use 42 as N
and 10 as M
, so we end up with a quotient q
of 4 (in unary: 1110) and a remainder r
of 2 (in binary 010), so that the resulting message is 1110,010
, or 8 bits (the comma can be skipped). The simple binary representation of 42 is 101010
, or 6 bits.
To me, this seems due to the unary representation of q
which always has to be more bits than binary.
Clearly, I'm missing some important point here. What is it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
重要的一点是,哥伦布码并不意味着比某个特定数字的最短二进制编码更短。 相反,通过提供特定类型的可变长度编码,它们减少了与固定宽度编码相比,每个编码值的平均长度,如果编码值来自很大的范围,但最常见的值通常很小(因此大多数情况下仅使用该范围的一小部分)
举个例子,如果您要传输 0 到 1000 范围内的整数,但大多数实际值在 0 到 10 之间,采用固定宽度编码,大多数传输的代码都会有前导 0,不包含任何信息:
要覆盖 0 到 1000 之间的所有值,您需要在固定宽度二进制中进行 10 位宽的编码。 现在,由于大多数值都低于 10,因此大多数数字的前 6 位至少为 0,并且携带的信息很少。
要使用哥伦布代码纠正此问题,可以通过将数字除以 10 并分别对商和余数进行编码来拆分数字。 对于大多数值,必须传输的只是最多可以使用 4 位进行编码的余数(如果对余数使用截断的二进制,则可能会更少)。 然后,商以一进制形式传输,对于 10 以下的所有值,编码为单个
0
位,对于 10..19、110
编码为10
> 对于 20..29 等。现在,对于大多数值,您已将消息大小减少到最大 5 位,但您仍然能够在没有分隔符的情况下明确传输所有值。
对于较大的值(例如,990..999 范围内的值需要 100 位的商),这会带来相当高的成本,这就是为什么编码对于 2 边几何分布来说是最佳的。
较大值的商中的长 1 位游程可以通过后续游程长度编码来解决。 但是,如果商在结果消息中占用太多空间,则可能表明其他代码可能比 Golomb/Rice 更合适。
The important point is that Golomb codes are not meant to be shorter than the shortest binary encoding for one particular number. Rather, by providing a specific kind of variable-length encoding, they reduce the average length per encoded value compared to fixed-width encoding, if the encoded values are from a large range, but the most common values are generally small (and hence are using only a small fraction of that range most of the time).
As an example, if you were to transmit integers in the range from 0 to 1000, but a large majority of the actual values were in the range between 0 and 10, in a fixed-width encoding, most of the transmitted codes would have leading 0s that contain no information:
To cover all values between 0 and 1000, you need a 10-bit wide encoding in fixed-width binary. Now, as most of your values would be below 10, at least the first 6 bits of most numbers would be 0 and would carry little information.
To rectify this with Golomb codes, you split the numbers by dividing them by 10 and encoding the quotient and the remainder separately. For most values, all that would have to be transmitted is the remainder which can be encoded using 4 bits at most (if you use truncated binary for the remainder it can be less). The quotient is then transmitted in unary, which encodes as a single
0
bit for all values below 10, as10
for 10..19,110
for 20..29 etc.Now, for most of your values, you have reduced the message size to 5 bits max, but you are still able to transmit all values unambigously without separators.
This comes at a rather high cost for the larger values (for example, values in the range 990..999 need 100 bits for the quotient), which is why the coding is optimal for 2-sided geometric distributions.
The long runs of 1 bits in the quotients of larger values can be addressed with subsequent run-length encoding. However, if the quotients consume too much space in the resulting message, this could indicate that other codes might be more appropriate than Golomb/Rice.
哥伦布编码和二进制代码之间的一个区别是,二进制代码不是前缀代码,这对于编码任意大数字的字符串是不行的(您无法确定 1010101010101010 是否是 10101010 和 10101010 的串联或其他)。 因此,它们不那么容易进行比较。
其次,Golomb 代码对于几何分布是最佳的,在本例中参数为 2^(-1/10)。 42 的概率约为 0.3%,因此您可以了解这对于输出字符串的长度有多重要。
One difference between the Golomb coding and binary code is that binary code is not a prefix code, which is a no-go for coding strings of arbitrarily large numbers (you cannot decide if 1010101010101010 is a concatenation of 10101010 and 10101010 or something else). Hence, they are not that easily comparable.
Second, the Golomb code is optimal for geometric distribution, in this case with parameter 2^(-1/10). The probability of 42 is some 0.3 %, so you get the idea about how important is this for the length of the output string.