ruby base64 编码/解码/解包('m') 麻烦
遇到奇怪的 ruby 编码:
ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=')
=> "a8dnsjg8aiw8jq=="
ruby-1.9.2-p180 :619 > s.size
=> 16
ruby-1.9.2-p180 :620 > s.unpack('m0')
ArgumentError: invalid base64
from (irb):631:in `unpack'
ruby-1.9.2-p180 :621 > s.unpack('m')
=> ["k\xC7g\xB28<j,<\x8E"]
ruby-1.9.2-p180 :622 > s.unpack('m').first.size
=> 10
ruby-1.9.2-p180 :623 > s.unpack('m').pack('m')
=> "a8dnsjg8aiw8jg==\n"
ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s
=> false
知道为什么这不对称吗!?为什么“m0”(decode64_strict)根本不起作用?输入字符串将填充为 Base64 字母表中 4 个字符的倍数。这里是 14 x 6 位 = 84 位,即 10 1/2 8 位字节,即 11 个字节。但解码后的字符串似乎丢失了最后一个半字节?
我是否遗漏了一些明显的东西或者这是一个错误?解决方法? 参见http://www.ietf.org/rfc/rfc4648.txt
Having a strange ruby encoding encounter:
ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=')
=> "a8dnsjg8aiw8jq=="
ruby-1.9.2-p180 :619 > s.size
=> 16
ruby-1.9.2-p180 :620 > s.unpack('m0')
ArgumentError: invalid base64
from (irb):631:in `unpack'
ruby-1.9.2-p180 :621 > s.unpack('m')
=> ["k\xC7g\xB28<j,<\x8E"]
ruby-1.9.2-p180 :622 > s.unpack('m').first.size
=> 10
ruby-1.9.2-p180 :623 > s.unpack('m').pack('m')
=> "a8dnsjg8aiw8jg==\n"
ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s
=> false
Any idea why this is not symmetric!? And why is 'm0' (decode64_strict) not working at all? The input string is padded out to a multiple of 4 characters in the base64 alphabet. Here it's 14 x 6 bits = 84 bits which is 10 1/2 8-bit bytes, i.e. 11 bytes. But the decoded string seems to drop the last nybble?
Am I missing something obvious or is this a bug? Workaround?
cf. http://www.ietf.org/rfc/rfc4648.txt
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不存在对称性,因为 Base64 不是填充字符串的一对一映射。让我们从实际解码的内容开始。如果您以十六进制查看解码后的字符串(例如使用
s.unpack('H*')
),它将是这样的:我将每个输入块的边界添加到 Base64 算法中:它需要 3 个八位字节输入并返回 4 个字符的输出。因此,我们的最后一个块仅包含一个输入八位字节,因此根据标准,结果将以“==”结尾的 4 个字符。
让我们看看最后一个块的规范编码是什么。在二进制表示
8E
是10001110
RFC 告诉我们用零填充缺失的位,直到达到所需的 24 位:我以 6 位为一组,因为这就是我们的目的。需要从 Base64 字母表中获取相应的字符。第一组 (100011) 转换为十进制 35,因此是 Base64 字母表中的
j
。 (100000) 是十进制的 32,因此根据规则,剩下的两个字符将被填充为“==”,所以规范的编码是如果你现在看 jq==,在二进制中这将是。
所以区别在于第二组。但是,由于我们已经知道只有前 8 位是我们感兴趣的(“==”告诉我们,我们只会从这四个字符中检索一个解码的八位字节),因此我们实际上只关心前两位第二组,因为组 1 的 6 位和组 2 的 2 个第一位构成了我们解码的八位字节。
100011 10
一起再次形成我们的初始8E
字节值。剩余的 16 位与我们无关,因此可以丢弃。这也暗示了为什么“严格”Base64 编码的概念是有意义的:非严格解码将丢弃最后的任何垃圾,而严格解码将检查最后一组 6 中的剩余位是否为零。这就是为什么您的非规范编码将被严格的解码规则拒绝。
There is no symmetry because Base64 is not a one-to-one mapping for padded strings. Let's start from the actual decoded content. If you view your decoded string in hex (using e.g.
s.unpack('H*')
it will be this:I added the boundaries for each input block to the Base64 algorithm: it takes 3 octets of input and returns 4 characters of output. So our last block contains only one input octet, thus the result will be 4 characters that end in "==" according to the standard.
Let's see what the canonical encoding of that last block would be. In binary representation
8E
is10001110
. The RFC tells us to pad the missing bits with zeroes until we reach the required 24 bits:I made groups of 6 bits, because that's what we need to get the corresponding characters from the Base64 alphabet. The first group (100011) translates to 35 decimal and thus is a
j
in the Base64 alphabet. The second (100000) is 32 decimal and hence a 'g'. The two remaining characters are to be padded as "==" according to the rules. So the canonical encoding isIf you look at jq== now, in binary this will be
So the difference is in the second group. But since we already know that only the first 8 bits are of interest to us (the "==" tells us so -> we only will retrieve one decoded octet from these four characters) we actually only care for the first two bits of the second group, because the 6 bits of group 1 and the 2 first bits of group 2 form our decoded octet.
100011 10
together form again our initial8E
byte value. The remaining 16 bits are irrelevant to us and can thus be discarded.This also implies why the notion of "strict" Base64 encoding makes sense: non-strict decoding will discard any garbage at the end whereas strict decoding will check for the remaining bits to be zero in the final group of 6's. That's why your non-canonical encoding will be rejected by strict decoding rules.
您链接的 RFC 明确表示,
xx==
形式的最后一个四元组对应于输入序列的一个八位字节。您无法从 12 位中得到 16 位信息(两个任意八位字节),因此此处向上舍入无效。您的字符串在严格模式下被拒绝,因为
jq==
无法作为正确的 Base64 编码过程的结果出现。长度不是 3 倍数的输入序列是零填充的,并且您的字符串具有不能出现的非零位:The RFC you've linked says plainly that the final quad of form
xx==
corresponds to one octet of the input sequence. You cannot make 16 bits of information (two arbitrary octets) out of 12, so rounding up is invalid here.Your string is rejected in the strict mode, because
jq==
cannot appear as a result of a correct Base64 encoding process. Input sequence which length is not multiple of 3 is zero-padded, and your string has non-zero bits where they cannot appear:来自 第 3.5 节规范编码 a href="https://www.rfc-editor.org/rfc/rfc4648" rel="nofollow noreferrer">RFC4648:
以及
最后四个字节 (
jq==
) 解码为这些二进制值:带下划线的位用于形成最后一个编码字节(十六进制 8E)。其余位(下面带有星号)应该为零(将编码为
jg==
,而不是jq==
)。m
解包会宽容应该为零但实际并非如此的填充位。m0
解包并不那么宽容,因为它是允许的(请参阅 RFC 中引用位中的“MAY”)。打包解包结果不是对称的,因为您的编码值是非规范的,但pack
方法会生成规范编码(填充位等于零)。From section 3.5 Canonical Encoding of RFC4648:
and
Your last four bytes (
jq==
) decode to these binary values:The underlined bits are used to form the last encoded byte (hex 8E). The remaining bits (with asterisks under them) are supposed to be zero (which would be encoded
jg==
, notjq==
).The
m
unpacking is being forgiving about the padding bits that should be zero but are not. Them0
unpacking is not so forgiving, as it is allowed to be (see “MAY” in the quoted bit from the RFC). Packing the unpacked result is not symmetric because your encoded value is non-canonical, but the thepack
method produces a canonical encoding (pad bits equal zero).感谢您对 b64 的良好解释。我已经对你们所有人投了赞成票并接受了@emboss 的回复。
然而,这不是我正在寻找的答案。为了更好地陈述这个问题,应该是,
从您的解释中,我现在看到这将适用于我们的目的:
唯一的问题是解码后的字符串长度没有被保留,但我们可以解决这个问题。
Thanks for the good explanations on b64. I've upvoted you all and accepted @emboss's response.
However, this is not the answer I was looking for. To better state the question, it would be,
From your explanations I now see that this will work for our purposes:
The only problem then being that the decoded string length is not preserved, but we can work around that.