Unicode 傻瓜指南
谁能给我一个
- Unicode
- UTF7
- UTF8
- UTF16
- UTF32
- 代码页
- 的简明定义它们与 Ascii/Ansi/Windows 1252 有何不同
我并不追求维基百科链接或令人难以置信的细节,只是一些关于 Unicode 中如何以及为何出现巨大变化的简短信息关于以及为什么作为程序员你应该关心。
Could anyone give me a concise definitions of
- Unicode
- UTF7
- UTF8
- UTF16
- UTF32
- Codepages
- How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这是一个好的开始:每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
如果您想要一个真正的简短介绍:
5 分钟了解 Unicode
或者,如果您需要单行:
为什么要关心?因为如果不知道所使用的字符集和编码,您就无法真正知道给定的字节流代表什么字符。例如,字节 0xDE 可以编码
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
除了经常引用的 Joel 文章之外,我还有我自己的文章,其中介绍了从以 .NET 为中心的角度来看,只是为了多样性......
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
是的,我得到了一些见解,但可能是错误的,但它帮助我理解了它。
让我们看一些文字。它作为一系列字节存储在计算机内存中,代码页只是您和我读取的字节和字符之间的映射表。因此,像记事本这样的东西会附带它的代码页,并将字节转换到屏幕上,你会看到一堆垃圾、颠倒的问号等。这并不意味着你的数据是乱码,只是读取字节的应用程序没有使用正确的格式。代码页。某些应用程序比其他应用程序更擅长检测要使用的正确代码页,并且内存中的某些字节流包含代表字节顺序标记的 BOM,这可以声明要使用的正确代码页。
UTF7、8 16 等只是使用不同格式的不同代码页。
使用不同代码页存储为字节的同一文件将具有不同的文件大小,因为字节存储方式不同。
它们与 Windows 1252 并没有真正的区别,因为这只是另一个代码页。
要获得更好、更智能的答案,请尝试其中一个链接。
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
在这里,请阅读乔尔本人的精彩解释。
每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)< /a>
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
其他人已经指出了足够好的参考资料。我列出的并不是真正的傻瓜指南,而是来自 Unicode Consortium 页面的一些指示。您可以在 Unicode Consortium 页面上找到使用不同编码的更多具体原因。
Unicode 常见问题解答 是回答您的一些(不是全部)疑问的好地方。
关于 Unicode 为何存在的更简洁的答案,请参见 Unicode 网站的新人部分 本身:
就使用 UTF-8、UTF-16 或 UTF-32 的技术原因而言,答案在于 Unicode 技术介绍:
一般的经验法则是,当应用程序支持的主要语言是印度河以西的语言时,使用 UTF-8;相反的语言(印度河以东)使用 UTF-16;当您担心使用字符时,使用 UTF-32并统一存储。
顺便说一句,UTF-7 不是 Unicode 标准,主要设计用于邮件应用程序。
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
首先,不存在“unicode 变体”。 Unicode 是一个标准,标准,用于将代码点(整数)分配给字符。 UTF8 是将这些整数表示为字节的最流行方式!
作为一名程序员为什么要关心这个问题?
示例:您从某处收到一个 ByteArray
myByteArray
并且您知道它代表字符。然后运行 myByteArray.toString()
并获得字符串Hello
。你的程序有效!发送代码一天后,您的德国客户致电:“我们遇到问题,äöü 未正确显示!”。您开始调试代码,在对编码没有基本了解的情况下感到非常迷茫。然而,通过对编码的理解,您知道错误可能是这样的:运行myByteArray.toString()
时,您的程序假定字符串是使用默认系统编码进行编码的。但也许事实并非如此!也许它是 UTF8,而你的系统是拉丁语系统,所以你应该运行myByteArray.toString("UTF8")
!资源:
我不会像其他人建议的那样推荐乔尔的文章。这是一篇很长的文章,有很多不相关的信息。几年前我读过它,但它的精髓并没有牢牢记住我的大脑,因为有很多不重要的细节。
正如已经提到的 http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes 是一个很棒的地方去争取掌握unicode的精髓。
如果您想真正了解像 UTF8 这样的可变长度编码,我建议 https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/。
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
Example: You receive a ByteArray
myByteArray
from somewhere and you know it represents characters. You then runmyByteArray.toString()
and you get the stringHello
. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When runningmyByteArray.toString()
, your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ranmyByteArray.toString("UTF8")
instead!Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.