如果输入长度不能被3整除,为什么base64编码需要填充?
base64编码中填充的目的是什么?以下是维基百科的摘录:
“分配了一个额外的填充字符,可用于强制编码输出为 4 个字符的整数倍(或等效地,当未编码的二进制文本不是 3 字节的倍数时);这些填充然后,在解码时必须丢弃字符,但当其输入二进制长度不是 3 字节的倍数时,仍然允许计算未编码文本的有效长度(最后一个非填充字符通常被编码,以便最后 6 个字符) - 它表示的位块将在其最低有效位上进行零填充,编码流末尾最多可能出现两个填充字符)。
我编写了一个程序,可以对任何字符串进行 Base64 编码并解码任何 Base64 编码的字符串。 padding解决什么问题?
What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您认为填充是不必要的结论是正确的。始终可以根据编码序列的长度明确确定输入的长度。
然而,在 Base64 编码的字符串以单个序列的长度丢失的方式连接的情况下,填充非常有用,例如在非常简单的网络协议中可能会发生这种情况。
如果连接未填充的字符串,则无法恢复原始数据,因为每个单独序列末尾的奇数字节数信息都会丢失。然而,如果使用填充序列,则不会有歧义,并且整个序列可以被正确解码。
编辑:插图
假设我们有一个程序,可以对单词进行 Base64 编码、连接它们并通过网络发送它们。它对“I”、“AM”和“TJM”进行编码,将结果夹在一起而不进行填充并传输它们。
I
编码为SQ
(SQ==
带填充)AM
编码为QU0
(QU0=
with padding)TJM
编码为VEpN
(VEpN
with padding)因此传输的数据为
SQQU0VEpN
。接收器将其进行 base64 解码为I\x04\x14\xd1Q)
,而不是预期的IAMTJM
。结果是无意义的,因为发送者已经破坏了有关编码序列中每个单词结束位置的信息。如果发送方发送了SQ==QU0=VEpN
,则接收方可以将其解码为三个单独的 Base64 序列,这些序列将连接起来给出IAMTJM
。为什么要费心填充?
为什么不直接设计协议为每个单词添加整数长度前缀呢?然后接收器可以正确解码流并且不需要填充。
这是一个好主意,只要我们在开始编码之前知道要编码的数据的长度即可。但是,如果我们不是对文字进行编码,而是对来自实时摄像机的视频块进行编码呢?我们可能事先不知道每个块的长度。
如果协议使用填充,则根本不需要传输长度。数据可以在从相机传入时进行编码,每个块都以填充终止,并且接收器将能够正确解码流。
显然,这是一个非常人为的示例,但也许它说明了为什么填充在某些情况下可能会有所帮助。
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I
encodes toSQ
(SQ==
with padding)AM
encodes toQU0
(QU0=
with padding)TJM
encodes toVEpN
(VEpN
with padding)So the transmitted data is
SQQU0VEpN
. The receiver base64-decodes this asI\x04\x14\xd1Q)
instead of the intendedIAMTJM
. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sentSQ==QU0=VEpN
instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to giveIAMTJM
.Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
在相关说明中,这是我为您创建的任意基本转换器。享受!
https://convert.zamicol.com
什么是填充字符?
填充字符有助于满足长度要求并且不具有其他含义。
填充的十进制示例:
考虑到所有字符串长度为 8 个字符的任意要求,数字 640 可以使用前面的 0 作为填充字符来满足此要求,因为它们没有任何意义,“00000640”。
二进制编码
字节范式:对于编码,字节是事实上的标准测量单位,任何方案都必须与字节相关。
Base256 完全符合字节范例。在 Base256 中,一个字节等于一个字符。
Base16,十六进制或十六进制,每个字符使用 4 位。一个字节可以表示两个base16字符。
与 Base256 和 Base16 不同,Base64 并不完全适合字节范式(base32 也不适合)。所有 Base64 字符都可以用 6 位表示,比完整字节少 2 位。
我们可以将 Base64 编码与字节范式表示为一个分数:每个字符 6 位超过每个字节 8 位。减少的这个部分是 4 个字符中的 3 个字节。
这个比例,每4个base64字符3个字节,就是我们在编码base64时要遵循的规则。 Base64 编码只能保证使用 3 字节捆绑进行测量,这与每个字节都可以独立存在的 Base16 和 Base256 不同。
那么,即使编码在没有填充字符的情况下也能正常工作,为什么还是鼓励填充呢?
如果流的长度未知或者准确了解数据流何时结束可能有帮助,请使用填充。填充字符明确表示这些额外的位置应该为空,并排除任何歧义。即使填充的长度未知,您也会知道数据流的结束位置。
作为一个反例,一些标准如 JOSE 不允许填充人物。在这种情况下,如果缺少某些内容,加密签名将不起作用,或者其他非 Base64 字符将丢失(例如“.”)。尽管没有做出关于长度的假设,但不需要填充,因为如果出现问题,它就无法工作。
这正是 base64 RFC 所说的,
填充允许我们解码 Base64 编码,并保证不会丢失位。如果没有填充,就不再明确确认以三字节束进行测量。如果没有填充,如果没有通常来自堆栈中其他位置的附加信息(例如 TCP、校验和或其他方法),您可能无法保证原始编码的精确再现。
替代像base64这样的存储桶转换方案是基数转换,它没有任意的存储桶大小,并且对于左-右侧的读者被左侧填充。 “迭代除以基数”转换方法通常用于基数转换。
示例
以下是 RFC 4648 的示例表单 (https://www.rfc-editor .org/rfc/rfc4648#section-8)
“BASE64”函数中的每个字符使用一个字节(base256)。然后我们将其转换为 base64。
您可以使用以下编码器:http://www.motobit。 com/util/base64-decoder-encoder.asp
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
在现代,它没有多大好处。因此,让我们将此视为一个问题:最初的历史目的可能是什么。
Base64 编码首次出现在 1993 年的 RFC 1421 中。该 RFC 实际上专注于加密电子邮件,base64 在一小节 4.3.2.4 中描述。
该 RFC 没有解释填充的目的。最接近我们最初目的的是这句话:
它不建议串联(此处的最佳答案),也不建议将易于实现作为填充的明确目的。然而,考虑到整个描述,假设这可能是为了帮助解码器以 32 位单元(“量子”)读取输入,这并不是没有道理的。这在今天没有任何好处,但在 1993 年,不安全的 C 代码很可能实际上利用了这个特性。
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
通过填充,base64 字符串的长度始终是 4 的倍数(如果不是,则字符串肯定已损坏),因此代码可以轻松地在一次处理 4 个字符的循环中处理该字符串(始终将 4 个输入字符转换为 3 个或更少的输出字节)。因此,填充使完整性检查变得容易(
length % 4 != 0
==> 错误,因为填充是不可能的),并且它使处理更简单、更高效。我知道人们会怎么想:即使没有填充,我也可以在循环中处理所有 4 字节块,然后只需为最后 1 到 3 个字节添加特殊处理(如果存在)。这只是几行额外的代码,速度差异很小,甚至无法测量。可能是这样,但您正在考虑 C(或更高语言)和具有充足 RAM 的强大 CPU。如果您需要使用简单的 DSP 在硬件中解码 Base64,该 DSP 的处理能力非常有限,没有 RAM 存储,并且您必须在非常有限的微汇编中编写代码,该怎么办?如果您根本无法使用代码并且所有事情都必须通过堆叠在一起的晶体管(硬连线硬件实现)来完成,该怎么办?使用填充比不使用要简单得多。
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (
length % 4 != 0
==> error as not possible with padding) and it makes processing simpler and more efficient.I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
填充以定义的方式将输出长度填充为四个字节的倍数。
Padding fills the output length to a multiple of four bytes in a defined way.