如果输入长度不能被3整除，为什么base64编码需要填充？

发布于 2024-09-30 00:07:46 字数 308 浏览 7 评论 0原文

base64编码中填充的目的是什么？以下是维基百科的摘录：

“分配了一个额外的填充字符，可用于强制编码输出为 4 个字符的整数倍（或等效地，当未编码的二进制文本不是 3 字节的倍数时）；这些填充然后，在解码时必须丢弃字符，但当其输入二进制长度不是 3 字节的倍数时，仍然允许计算未编码文本的有效长度（最后一个非填充字符通常被编码，以便最后 6 个字符） - 它表示的位块将在其最低有效位上进行零填充，编码流末尾最多可能出现两个填充字符）。

我编写了一个程序，可以对任何字符串进行 Base64 编码并解码任何 Base64 编码的字符串。 padding解决什么问题？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谷夏 2024-10-07 00:07:46

您认为填充是不必要的结论是正确的。始终可以根据编码序列的长度明确确定输入的长度。

然而，在 Base64 编码的字符串以单个序列的长度丢失的方式连接的情况下，填充非常有用，例如在非常简单的网络协议中可能会发生这种情况。

如果连接未填充的字符串，则无法恢复原始数据，因为每个单独序列末尾的奇数字节数信息都会丢失。然而，如果使用填充序列，则不会有歧义，并且整个序列可以被正确解码。

编辑：插图

假设我们有一个程序，可以对单词进行 Base64 编码、连接它们并通过网络发送它们。它对“I”、“AM”和“TJM”进行编码，将结果夹在一起而不进行填充并传输它们。

I 编码为 SQ（SQ== 带填充）
AM 编码为 QU0 (QU0= with padding)
TJM 编码为 VEpN (VEpN with padding)

因此传输的数据为 SQQU0VEpN。接收器将其进行 base64 解码为 I\x04\x14\xd1Q)，而不是预期的 IAMTJM。结果是无意义的，因为发送者已经破坏了有关编码序列中每个单词结束位置的信息。如果发送方发送了 SQ==QU0=VEpN，则接收方可以将其解码为三个单独的 Base64 序列，这些序列将连接起来给出 IAMTJM。

为什么要费心填充？

为什么不直接设计协议为每个单词添加整数长度前缀呢？然后接收器可以正确解码流并且不需要填充。

这是一个好主意，只要我们在开始编码之前知道要编码的数据的长度即可。但是，如果我们不是对文字进行编码，而是对来自实时摄像机的视频块进行编码呢？我们可能事先不知道每个块的长度。

如果协议使用填充，则根本不需要传输长度。数据可以在从相机传入时进行编码，每个块都以填充终止，并且接收器将能够正确解码流。

显然，这是一个非常人为的示例，但也许它说明了为什么填充在某些情况下可能会有所帮助。

Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.

If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.

Edit: An Illustration

Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.

I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)

So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.

Why Bother with Padding?

Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.

That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.

If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.

Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.

回复收藏 0 原文

画中仙 2024-10-07 00:07:46

在相关说明中，这是我为您创建的任意基本转换器。享受！
https://convert.zamicol.com

什么是填充字符？

填充字符有助于满足长度要求并且不具有其他含义。

填充的十进制示例：
考虑到所有字符串长度为 8 个字符的任意要求，数字 640 可以使用前面的 0 作为填充字符来满足此要求，因为它们没有任何意义，“00000640”。

二进制编码

字节范式：对于编码，字节是事实上的标准测量单位，任何方案都必须与字节相关。

Base256 完全符合字节范例。在 Base256 中，一个字节等于一个字符。

Base16，十六进制或十六进制，每个字符使用 4 位。一个字节可以表示两个base16字符。

与 Base256 和 Base16 不同，Base64 并不完全适合字节范式（base32 也不适合）。所有 Base64 字符都可以用 6 位表示，比完整字节少 2 位。

我们可以将 Base64 编码与字节范式表示为一个分数：每个字符 6 位超过每个字节 8 位。减少的这个部分是 4 个字符中的 3 个字节。

这个比例，每4个base64字符3个字节，就是我们在编码base64时要遵循的规则。 Base64 编码只能保证使用 3 字节捆绑进行测量，这与每个字节都可以独立存在的 Base16 和 Base256 不同。

那么，即使编码在没有填充字符的情况下也能正常工作，为什么还是鼓励填充呢？

如果流的长度未知或者准确了解数据流何时结束可能有帮助，请使用填充。填充字符明确表示这些额外的位置应该为空，并排除任何歧义。即使填充的长度未知，您也会知道数据流的结束位置。

作为一个反例，一些标准如 JOSE 不允许填充人物。在这种情况下，如果缺少某些内容，加密签名将不起作用，或者其他非 Base64 字符将丢失（例如“.”）。尽管没有做出关于长度的假设，但不需要填充，因为如果出现问题，它就无法工作。

这正是 base64 RFC 所说的，

在某些情况下，在基本编码数据中使用填充（“=”）
不需要或不使用。在一般情况下，当假设
传输数据的大小无法确定，需要填充
产生正确的解码数据。
[...]
基数 64 中的填充步骤 [...] 如果不正确
实施，导致编码数据的非显着改变。
例如，如果输入的 Base 64 编码只有一个八位字节，
然后使用第一个符号的所有六位，但仅使用第一个
使用下一个符号的两位。这些填充位必须设置为
通过一致的编码器归零，这在描述中进行了描述
在下面的填充上。如果该属性不成立，则不存在
基本编码数据的规范表示，以及多个基本编码
编码的字符串可以解码为相同的二进制数据。如果这个
财产（以及本文档中讨论的其他财产）成立，规范
保证编码。

填充允许我们解码 Base64 编码，并保证不会丢失位。如果没有填充，就不再明确确认以三字节束进行测量。如果没有填充，如果没有通常来自堆栈中其他位置的附加信息（例如 TCP、校验和或其他方法），您可能无法保证原始编码的精确再现。

替代像base64这样的存储桶转换方案是基数转换，它没有任意的存储桶大小，并且对于左-右侧的读者被左侧填充。 “迭代除以基数”转换方法通常用于基数转换。

示例

以下是 RFC 4648 的示例表单 (https://www.rfc-editor .org/rfc/rfc4648#section-8）

“BASE64”函数中的每个字符使用一个字节（base256）。然后我们将其转换为 base64。

BASE64("")       = ""           (No bytes used. 0 % 3 = 0)
BASE64("f")      = "Zg=="       (One byte used. 1 % 3 = 1)
BASE64("fo")     = "Zm8="       (Two bytes.     2 % 3 = 2)
BASE64("foo")    = "Zm9v"       (Three bytes.   3 % 3 = 0)
BASE64("foob")   = "Zm9vYg=="   (Four bytes.    4 % 3 = 1)
BASE64("fooba")  = "Zm9vYmE="   (Five bytes.    5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy"   (Six bytes.     6 % 3 = 0)

您可以使用以下编码器：http://www.motobit。 com/util/base64-decoder-encoder.asp

On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com

What are Padding Characters?

Padding characters help satisfy length requirements and carry no other meaning.

Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".

Binary Encoding

The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.

Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.

Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.

Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.

We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.

This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.

So why is padding encouraged even though encoding could work just fine without the padding characters?

If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.

As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.

And this is exactly what the base64 RFC says,

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.

Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.

Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.

Examples

Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)

Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.

BASE64("")       = ""           (No bytes used. 0 % 3 = 0)
BASE64("f")      = "Zg=="       (One byte used. 1 % 3 = 1)
BASE64("fo")     = "Zm8="       (Two bytes.     2 % 3 = 2)
BASE64("foo")    = "Zm9v"       (Three bytes.   3 % 3 = 0)
BASE64("foob")   = "Zm9vYg=="   (Four bytes.    4 % 3 = 1)
BASE64("fooba")  = "Zm9vYmE="   (Five bytes.    5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy"   (Six bytes.     6 % 3 = 0)

Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp

回复收藏 0 原文

￠蛋碎的人ぎ生 2024-10-07 00:07:46

在现代，它没有多大好处。因此，让我们将此视为一个问题：最初的历史目的可能是什么。

Base64 编码首次出现在 1993 年的 RFC 1421 中。该 RFC 实际上专注于加密电子邮件，base64 在一小节 4.3.2.4 中描述。

该 RFC 没有解释填充的目的。最接近我们最初目的的是这句话：

完整的编码量始终在消息末尾完成。

它不建议串联（此处的最佳答案），也不建议将易于实现作为填充的明确目的。然而，考虑到整个描述，假设这可能是为了帮助解码器以 32 位单元（“量子”）读取输入，这并不是没有道理的。这在今天没有任何好处，但在 1993 年，不安全的 C 代码很可能实际上利用了这个特性。

回复收藏 0 原文

尾戒 2024-10-07 00:07:46

通过填充，base64 字符串的长度始终是 4 的倍数（如果不是，则字符串肯定已损坏），因此代码可以轻松地在一次处理 4 个字符的循环中处理该字符串（始终将 4 个输入字符转换为 3 个或更少的输出字节）。因此，填充使完整性检查变得容易（length % 4 != 0 ==> 错误，因为填充是不可能的），并且它使处理更简单、更高效。

我知道人们会怎么想：即使没有填充，我也可以在循环中处理所有 4 字节块，然后只需为最后 1 到 3 个字节添加特殊处理（如果存在）。这只是几行额外的代码，速度差异很小，甚至无法测量。可能是这样，但您正在考虑 C（或更高语言）和具有充足 RAM 的强大 CPU。如果您需要使用简单的 DSP 在硬件中解码 Base64，该 DSP 的处理能力非常有限，没有 RAM 存储，并且您必须在非常有限的微汇编中编写代码，该怎么办？如果您根本无法使用代码并且所有事情都必须通过堆叠在一起的晶体管（硬连线硬件实现）来完成，该怎么办？使用填充比不使用要简单得多。

回复收藏 0 原文