为什么不使用base128?
为什么在网络上传输二进制数据时只使用base64而不是base128? ASCII字符集有128个字符,理论上可以表示base 128,但大多数情况下只使用base64而不使用base128。
Why is only base64 instead of base128 used to transmit binary data on the web? The ASCII character set has 128 characters which in theory could represent base 128, but only base64 but not base128 is used in most cases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
问题是 ASCII 字符集中至少有 32 个字符是可以由接收终端解释的“控制字符”。例如,BEL(响铃)字符使接收终端发出铃声。 SOT(传输开始)和 EOT(传输结束)字符的作用正如其名称所暗示的那样。并且不要忘记字符 CR 和 LF,它们可能在数据结构如何序列化/展平到流中具有特殊含义。
Adobe 创建了Base85 编码来使用 ASCII 字符集中的更多字符,但据我所知它受专利保护。
The problem is that at least 32 characters of the ASCII character set are 'control characters' which may be interpreted by the receiving terminal. E.g., there's the BEL (bell) character that makes the receiving terminal chime. There's the SOT (Start Of Transmission) and EOT (End Of Transmission) characters which performs exactly what their names imply. And don't forget the characters CR and LF, which may have special meanings in how data structures are serialized/flattened into a stream.
Adobe created the Base85 encoding to use more characters in the ASCII character set, but AFAIK it's protected by patents.
因为这 128 个字符中有一些是不可打印的(主要是那些低于代码点 0x20 的字符)。因此,它们不能作为字符串可靠地通过线路传输。而且,如果代码点高于 128,则可能会遇到编码问题,因为跨系统使用不同的编码。
Because some of those 128 characters are unprintable (mainly those that is below codepoint 0x20). Therefore, they can't reliably be transmitted as a string over the wire. And, if you go above codepoint 128, you can have encoding issues because of different encodings used across systems.
正如其他答案中已经指出的那样,关键是将字符集减少为可打印字符集。
更有效的编码方案是 basE91 因为它使用更大的字符集并且仍然避免低位中的控制/空白字符ASCII 范围。该网页对二进制、base64 和 basE91 编码效率进行了很好的比较。
我曾经清理过Java实现。如果人们感兴趣,我可以将其推送到 GitHub 上。
更新:现在在 GitHub 上。
As already stated in the other answers, the key point is to reduce the character set to the printable ones.
A more efficient encoding scheme is basE91 because it uses a larger character set and still avoids control/whitespace characters in the low ASCII range. The webpage contains a nice comparison of binary vs. base64 vs. basE91 encoding efficiency.
I once cleaned up the Java implementation. If people are interested I could push it on GitHub.
Update: It's now on GitHub.
前 32 个字符是控制字符完全没有关系,因为您不必使用它们来获取 128 个字符。我们有 256 个字符可供选择,其中只有前 32 个是控制字符。剩下 192 个字符,因此 128 个字符在不使用控制字符的情况下是完全可能的。
原因如下:它必须看起来相同,并且无论在哪里都可以复制和粘贴。因此,它必须是在任何论坛、聊天、电子邮件等上显示相同的字符。这意味着我们不能使用论坛/聊天/电子邮件客户端通常用于格式化或忽略的字符。它还必须是相同的字符,无论字体、语言和区域设置如何。
就是这个原因!
That the first 32 characters are control character has absolutely no relevance, because you don't have to use them to get 128 characters. We have 256 characters to choose from, and only the first 32 are control characters. That leaves 192 characters, and therefore 128 is completely possible without using control characters.
Here is the reason: It has to be something that will look the same, and that you can copy and paste, no matter where. Therefor it has to be characters that will be displayed the same on any forum, chat, email and so on. That means that we can't use characters, that a forum/chat/email clients may typically use for formatting or disregard. It also has to be characters that are the same, regardless of font, language and regional settings.
That is the reason!
Base64 很常见,因为它解决了各种问题(几乎适用于您能想到的任何地方)
您无需担心传输是否8 位干净 与否。
编码中的所有字符都是可打印的。您可以看到它们。您可以复制并粘贴它们。您可以在 URL(特定变体)中使用它们。等等
固定编码大小。您知道
m
字节始终可以编码为n
字节。每个人都听说过它 - 它受到广泛支持,有很多库,因此易于互操作。
Base128 不具备所有这些优点。
看起来它是 8 位干净的 - 但请记住,base64 使用 65 个符号。如果没有带外字符,您就无法获得固定编码大小的好处。如果您使用带外字符,则无法再保持 8 位干净。
但也不全是负面的。
base128 比 base64 更容易编码/解码 - 您只需使用移位和掩码。对于嵌入式实现很重要
base128 通过使用更多的可用位,比 base64 更有效地使用传输。
人们确实使用base128 - 我现在正在使用它。这只是不常见。
Base64 is common because it solves a variety of issues (works nearly everywhere you can think of)
You don't need to worry whether the transport is 8-bit clean or not.
All the characters in the encoding are printable. You can see them. You can copy and paste them. You can use them in URLs (particular variants). etc.
Fixed encoding size. You know that
m
bytes can always encode ton
bytes.Everyone has heard of it - it's widely supported, lots of libraries, so easy to interoperate with.
Base128 doesn't have all those advantages.
It looks like it's 8-bit clean - but recall that base64 uses 65 symbols. Without an out-of-band character you can't have the benefits of a fixed encoding size. If you use an out-of-band character, you can't be 8-bit clean anymore.
It's not all negative though.
base128 is easier to encode/decode than base64 - you just use shifts and masks. Can be important for embedded implementations
base128 makes slightly more efficient use of the transport than base64 by using more of the available bits.
People do use base128 - I'm using it for something now. It's just not as common.
不确定,但我认为较低的值(表示控制代码或其他内容)不能作为 HTTP 请求/响应中的文本/字符可靠地传输,并且高于 127 的值可能是区域设置/代码页/任何特定的内容,因此没有128 个不同的字符预计可在所有浏览器/平台上使用。
Not sure, but I think the lower values (representing control codes or something) are not reliably transferred as text/characters inside HTTP-requests/responses, and the values above 127 might be locale/codepage/whatever-specific, so there are not 128 different characters that can be expected to work across all browsers/platforms.
埃萨吉是对的。 Base64 用于对二进制数据进行编码,以便使用仅需要文本的协议进行传输。它位于 Wiki 条目中。
esaji is right. Base64 is used to encode binary data for transmission using a protocol that expects only text. It's right in the Wiki entry.
查看 base128 PHP 类。使用 ISO 8859-1 字符集进行编码和解码。
GoogleCode PHP 类 Base128
Checkout the base128 PHP-Class. Encoding and decoding with ISO 8859-1 charset.
GoogleCode PHP-Class Base128