ruby base64 编码/解码/解包('m') 麻烦

发布于 2024-11-29 14:29:04 字数 860 浏览 3 评论 0原文

遇到奇怪的 ruby 编码：

ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=')
 => "a8dnsjg8aiw8jq==" 
ruby-1.9.2-p180 :619 > s.size
 => 16 

ruby-1.9.2-p180 :620 > s.unpack('m0')
ArgumentError: invalid base64
    from (irb):631:in `unpack'

ruby-1.9.2-p180 :621 > s.unpack('m')
 => ["k\xC7g\xB28<j,<\x8E"] 
ruby-1.9.2-p180 :622 > s.unpack('m').first.size
 => 10

ruby-1.9.2-p180 :623 > s.unpack('m').pack('m')
 => "a8dnsjg8aiw8jg==\n" 
ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s
 => false

知道为什么这不对称吗！？为什么“m0”（decode64_strict）根本不起作用？输入字符串将填充为 Base64 字母表中 4 个字符的倍数。这里是 14 x 6 位 = 84 位，即 10 1/2 8 位字节，即 11 个字节。但解码后的字符串似乎丢失了最后一个半字节？

我是否遗漏了一些明显的东西或者这是一个错误？解决方法？参见http://www.ietf.org/rfc/rfc4648.txt

原文

Having a strange ruby encoding encounter:

ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=')
 => "a8dnsjg8aiw8jq==" 
ruby-1.9.2-p180 :619 > s.size
 => 16 

ruby-1.9.2-p180 :620 > s.unpack('m0')
ArgumentError: invalid base64
    from (irb):631:in `unpack'

ruby-1.9.2-p180 :621 > s.unpack('m')
 => ["k\xC7g\xB28<j,<\x8E"] 
ruby-1.9.2-p180 :622 > s.unpack('m').first.size
 => 10

ruby-1.9.2-p180 :623 > s.unpack('m').pack('m')
 => "a8dnsjg8aiw8jg==\n" 
ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s
 => false

Any idea why this is not symmetric!? And why is 'm0' (decode64_strict) not working at all? The input string is padded out to a multiple of 4 characters in the base64 alphabet. Here it's 14 x 6 bits = 84 bits which is 10 1/2 8-bit bytes, i.e. 11 bytes. But the decoded string seems to drop the last nybble?

Am I missing something obvious or is this a bug? Workaround?
cf. http://www.ietf.org/rfc/rfc4648.txt

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玩套路吗 2024-12-06 14:29:04

不存在对称性，因为 Base64 不是填充字符串的一对一映射。让我们从实际解码的内容开始。如果您以十六进制查看解码后的字符串（例如使用 s.unpack('H*') ），它将是这样的：

6B C7 67 | B2 38 3C | 6A 2C 3C | 8E

我将每个输入块的边界添加到 Base64 算法中：它需要 3 个八位字节输入并返回 4 个字符的输出。因此，我们的最后一个块仅包含一个输入八位字节，因此根据标准，结果将以“==”结尾的 4 个字符。

让我们看看最后一个块的规范编码是什么。在二进制表示 8E 是 10001110 RFC 告诉我们用零填充缺失的位，直到达到所需的 24 位：

100011 100000 000000 000000

我以 6 位为一组，因为这就是我们的目的。需要从 Base64 字母表中获取相应的字符。第一组 (100011) 转换为十进制 35，因此是 Base64 字母表中的 j。 (100000) 是十进制的 32，因此根据规则，剩下的两个字符将被填充为“==”，所以规范的编码是

jg==

如果你现在看 jq==，在二进制中这将是。

100011 101010 000000 000000

所以区别在于第二组。但是，由于我们已经知道只有前 8 位是我们感兴趣的（“==”告诉我们，我们只会从这四个字符中检索一个解码的八位字节），因此我们实际上只关心前两位第二组，因为组 1 的 6 位和组 2 的 2 个第一位构成了我们解码的八位字节。 100011 10 一起再次形成我们的初始 8E 字节值。剩余的 16 位与我们无关，因此可以丢弃。

这也暗示了为什么“严格”Base64 编码的概念是有意义的：非严格解码将丢弃最后的任何垃圾，而严格解码将检查最后一组 6 中的剩余位是否为零。这就是为什么您的非规范编码将被严格的解码规则拒绝。

There is no symmetry because Base64 is not a one-to-one mapping for padded strings. Let's start from the actual decoded content. If you view your decoded string in hex (using e.g. s.unpack('H*') it will be this:

6B C7 67 | B2 38 3C | 6A 2C 3C | 8E

I added the boundaries for each input block to the Base64 algorithm: it takes 3 octets of input and returns 4 characters of output. So our last block contains only one input octet, thus the result will be 4 characters that end in "==" according to the standard.

Let's see what the canonical encoding of that last block would be. In binary representation 8E is 10001110. The RFC tells us to pad the missing bits with zeroes until we reach the required 24 bits:

100011 100000 000000 000000

I made groups of 6 bits, because that's what we need to get the corresponding characters from the Base64 alphabet. The first group (100011) translates to 35 decimal and thus is a j in the Base64 alphabet. The second (100000) is 32 decimal and hence a 'g'. The two remaining characters are to be padded as "==" according to the rules. So the canonical encoding is

jg==

If you look at jq== now, in binary this will be

100011 101010 000000 000000

So the difference is in the second group. But since we already know that only the first 8 bits are of interest to us (the "==" tells us so -> we only will retrieve one decoded octet from these four characters) we actually only care for the first two bits of the second group, because the 6 bits of group 1 and the 2 first bits of group 2 form our decoded octet. 100011 10 together form again our initial 8E byte value. The remaining 16 bits are irrelevant to us and can thus be discarded.

This also implies why the notion of "strict" Base64 encoding makes sense: non-strict decoding will discard any garbage at the end whereas strict decoding will check for the remaining bits to be zero in the final group of 6's. That's why your non-canonical encoding will be rejected by strict decoding rules.

回复收藏 0 原文

高跟鞋的旋律 2024-12-06 14:29:04

您链接的 RFC 明确表示，xx== 形式的最后一个四元组对应于输入序列的一个八位字节。您无法从 12 位中得到 16 位信息（两个任意八位字节），因此此处向上舍入无效。

您的字符串在严格模式下被拒绝，因为 jq== 无法作为正确的 Base64 编码过程的结果出现。长度不是 3 倍数的输入序列是零填充的，并且您的字符串具有不能出现的非零位：

   j      q      =      =
|100011|101010|000000|000000|
|10001110|10100000|00000000|
          ^^^

The RFC you've linked says plainly that the final quad of form xx== corresponds to one octet of the input sequence. You cannot make 16 bits of information (two arbitrary octets) out of 12, so rounding up is invalid here.

Your string is rejected in the strict mode, because jq== cannot appear as a result of a correct Base64 encoding process. Input sequence which length is not multiple of 3 is zero-padded, and your string has non-zero bits where they cannot appear:

   j      q      =      =
|100011|101010|000000|000000|
|10001110|10100000|00000000|
          ^^^

回复收藏 0 原文

烂人 2024-12-06 14:29:04

来自第 3.5 节规范编码 a href="https://www.rfc-editor.org/rfc/rfc4648" rel="nofollow noreferrer">RFC4648：

例如，如果输入的 Base 64 编码只有一个八位字节，
然后使用第一个符号的所有六位，但仅使用第一个
使用下一个符号的两位。这些填充位必须设置为
通过合格的编码器归零...

以及

在某些环境中，更改至关重要，因此
如果填充位没有，解码器可以选择拒绝编码
已设置为零。

最后四个字节 (jq==) 解码为这些二进制值：

100011 101010
------ --****

带下划线的位用于形成最后一个编码字节（十六进制 8E）。其余位（下面带有星号）应该为零（将编码为 jg==，而不是 jq==）。

m 解包会宽容应该为零但实际并非如此的填充位。 m0 解包并不那么宽容，因为它是允许的（请参阅 RFC 中引用位中的“MAY”）。打包解包结果不是对称的，因为您的编码值是非规范的，但 pack 方法会生成规范编码（填充位等于零）。

From section 3.5 Canonical Encoding of RFC4648:

For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders…

and

In some environments, the alteration is critical and therefore
decoders MAY chose to reject an encoding if the pad bits have not
been set to zero.

Your last four bytes (jq==) decode to these binary values:

100011 101010
------ --****

The underlined bits are used to form the last encoded byte (hex 8E). The remaining bits (with asterisks under them) are supposed to be zero (which would be encoded jg==, not jq==).

The m unpacking is being forgiving about the padding bits that should be zero but are not. The m0 unpacking is not so forgiving, as it is allowed to be (see “MAY” in the quoted bit from the RFC). Packing the unpacked result is not symmetric because your encoded value is non-canonical, but the the pack method produces a canonical encoding (pad bits equal zero).

回复收藏 0 原文

还在原地等你 2024-12-06 14:29:04

感谢您对 b64 的良好解释。我已经对你们所有人投了赞成票并接受了@emboss 的回复。

然而，这不是我正在寻找的答案。为了更好地陈述这个问题，应该是，

如何填充 b64 字符的字符串，以便将其解码为
通过 unpack('m0') 零填充 8 位字节？

从您的解释中，我现在看到这将适用于我们的目的：

ruby-1.9.2-p180 :858 >   s = "a8dnsjg8aiw8jq".ljust(16,'A')
 => "a8dnsjg8aiw8jqAA" 
ruby-1.9.2-p180 :859 > s.unpack('m0')
 => ["k\xC7g\xB28<j,<\x8E\xA0\x00"] 
ruby-1.9.2-p180 :861 > s.unpack('m0').pack('m0') == s
 => true

唯一的问题是解码后的字符串长度没有被保留，但我们可以解决这个问题。

Thanks for the good explanations on b64. I've upvoted you all and accepted @emboss's response.

However, this is not the answer I was looking for. To better state the question, it would be,

How to pad a string of b64 characters so that it can be decoded to
zero-padded 8-bit bytes by unpack('m0')?

From your explanations I now see that this will work for our purposes:

ruby-1.9.2-p180 :858 >   s = "a8dnsjg8aiw8jq".ljust(16,'A')
 => "a8dnsjg8aiw8jqAA" 
ruby-1.9.2-p180 :859 > s.unpack('m0')
 => ["k\xC7g\xB28<j,<\x8E\xA0\x00"] 
ruby-1.9.2-p180 :861 > s.unpack('m0').pack('m0') == s
 => true

The only problem then being that the decoded string length is not preserved, but we can work around that.

回复收藏 0 原文

~没有更多了~