为什么大家都用latin1?
有人刚刚说 utf8 具有从 1 到 3 个字节的可变长度编码。
那么为什么大家仍然使用latin1呢?同样的东西如果用utf8存储也是1个字节,但是utf8的优点是可以适应更大的字符集。
- 这是每个人都使用 latin1 的隐藏原因吗?
- 使用 utf8 与 latin1 相比有哪些缺点?
Someone just said utf8 has variable length encoding from 1 to 3 bytes.
So why does everyone still use latin1? If the same thing is stored in utf8 it is also 1 byte, but utf8 has the advantage that it can adapt to a larger character set.
- Is their a hidden reason everyone uses latin1?
- What are the disadvantages of using utf8 vs. latin1?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
ISO 8859-1 是多个标准的(至少事实上)默认字符编码,例如 HTTP(至少对于文本内容):
选择 ISO 8859-1 的原因可能是因为它是 US-ASCII 的超集,而 US-ASCII 是基于互联网的技术的基本字符集。由于万维网是在瑞士日内瓦的 CERN 发明和开发的,这可能是选择西欧语言字符作为剩余 128 个字符的原因。
在制定 Unicode 标准时,ISO 8859-1 字符集被用作 Unicode 字符集(通用字符集)的基础,因此前 256 个字符与 ISO 的字符集相同8859-1。这样做可能是由于 ISO 8859-1 对于 Web 的重要性,因为它已经是许多技术的标准字符编码。
现在要讨论 ISO 8859-1 相对于 UTF-8 的优势,我们需要了解底层字符集以及用于对这些字符进行编码的编码方案:
ISO 8859-1 包含 256 个字符,其中每个字符的字符点直接映射到其二进制表示形式。因此 12310 编码为 011110112。
UTF-8 使用带前缀的可变长度编码方案,其中前缀表示字长。 UTF-8用于对通用字符集的字符进行编码,其编码方案可以编码1,048,576个字符。前128个字符需要1个字节,0x80-0x7FF中的字符需要2个字节,0x800-0xFFFF中的字符需要3个字节,0x10000-0x1FFFFF中的字符需要4个字节。
因此,一方面是可编码字符的范围,另一方面是编码字的长度。
因此,“正确”字符编码的选择取决于需求:如果您只需要 ISO 8859-1(或 US-ASCII 作为其子集)的字符,请使用 ISO 8859-1,因为它只需要一个字节每个字符与 UTF-8 相反,其中字符 128-255 需要两个字节。如果您需要比 ISO 8859-1 中的字符更多或其他字符,请使用 UTF-8。
ISO 8859-1 is the (at least de facto) default character encoding of multiple standards like HTTP (at least for textual contents):
The reason that ISO 8859-1 was chosen is probably as it’s a superset of US-ASCII that is the fundamental character set for internet based technologies. And as the World Wide Web was invented and developed at CERN in Geneva, Switzerland, that might be the reason to choose characters of Western European languages for the 128 remaining characters.
When the Unicode standard was developed, the character set of ISO 8859-1 was used for the base of the Unicode character set (the Universal Character Set) so that the first 256 character are identical to those of ISO 8859-1. This was probably done due to the importance of ISO 8859-1 for the Web as it already was the standard character encoding for many technologies.
Now to discuss the advantages of ISO 8859-1 in opposite to UTF-8, we need to look at the underlying character sets and the encoding schemes that are used to encode these characters:
ISO 8859-1 contains 256 characters where the character point of each character is directly mapped onto its binary representation. So 12310 is encoded with 011110112.
UTF-8 uses a prefixed variable length encoding scheme where the prefix indicates the word length. UTF-8 is used to encode the characters of the Universal Character Set and its encoding scheme can encode 1,048,576 characters. The first 128 characters require 1 byte, the characters in 0x80–0x7FF require 2 bytes, the characters in 0x800–0xFFFF require 3 bytes, and the characters in 0x10000–0x1FFFFF require 4 bytes.
So the difference if the range of codeable characters on the one hand and the length of the encoded word on the other hand.
So the choice of the “right” character encoding depends on the needs: If you only need the characters of ISO 8859-1 (or US-ASCII as a subset of it), use ISO 8859-1 as it only requires one byte for each character in opposite to UTF-8 where the characters 128–255 require two bytes. And if you need more or other characters than those in ISO 8859-1, use UTF-8.
1)性能原因。
在长度恒定的情况下,查找字符串的第 n 个字符很容易。对于可变长度,您必须从字符串开头遍历所有字符才能知道它们的长度。
在 unicode 中实现这种性能的唯一方法是通过 utf-32(所有字符均为 4 个字节)。但需要更多内存。
2)Latin-1中所有带有变音符号(重音符号)的字符都在latin-1的128-255范围内,因此在utf-8中用多个字符进行编码。
3)很多程序员不知道如何使用unicode
1) Performance reasons.
With a constant-length, going to the n-th character of a string is easy. With variable length, you have to go through all characters from the beginning of the string to know their length.
The only way to achieve this performance in unicode is through utf-32 (all characters are 4 bytes). But it takes more memory.
2) All characters with diacritics (accents) in Latin-1 are in the 128-255 range of latin-1, and therefore are encoded with more than one character in utf-8.
3) A lot of programmer don't know how to use unicode
这可能是一个“原因”,
将不同的它们混合在一起真的很烦人,所以你就选择其余的东西
(我并不是说这是一个很好的理由,但我认为这是一些人使用的一个)
This could be a "reason"
Its really annoying mixing different them, so you go with what the rest goes with
(i'm not saying it's a good reason, but I think it's one some people use)