Unicode 傻瓜指南

发布于 2024-08-05 12:34:07 字数 255 浏览 5 评论 0原文

谁能给我一个

  • Unicode
  • UTF7
  • UTF8
  • UTF16
  • UTF32
  • 代码页
  • 的简明定义它们与 Ascii/Ansi/Windows 1252 有何不同

我并不追求维基百科链接或令人难以置信的细节,只是一些关于 Unicode 中如何以及为何出现巨大变化的简短信息关于以及为什么作为程序员你应该关心。

Could anyone give me a concise definitions of

  • Unicode
  • UTF7
  • UTF8
  • UTF16
  • UTF32
  • Codepages
  • How they differ from Ascii/Ansi/Windows 1252

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

兲鉂ぱ嘚淚 2024-08-12 12:34:07

如果您想要一个真正的简短介绍:
5 分钟了解 Unicode

或者,如果您需要单行:

  • Unicode: 字符到 0 到 1,114,111 范围内的整数(“代码点”)的映射;涵盖几乎所有正在使用的书面语言
  • UTF7: 将代码点编码为字节流,高位清零;一般来说,不要使用
  • UTF8:将代码点编码为字节流,其中每个字符可能需要一个、两个、三个或四个字节来表示;应该是编码的主要选择
  • UTF16: 将代码点编码为字流(16 位单位),其中每个字符可能采用一个或两个字(两个或四个字节)来表示
  • >UTF32:将代码点编码为 32 位单元流,其中每个字符恰好占用一个单元(四个字节);有时用于内部表示
  • 代码页:DOS 和 Windows 中的一个系统,通过该系统将字符分配给整数以及相关的编码;每个仅涵盖语言的一个子集。请注意,这些分配通常与 Unicode 分配
  • ASCII 不同:一种非常常见的字符到整数的分配,以及直接编码为字节(所有高位清零);分配是 Unicode 的子集,编码是 UTF-8 的子集
  • ANSI: 标准体
  • Windows 1252: 常用的代码页;它类似于 ISO-8859-1 或 Latin-1,但并不相同,并且两者经常混淆

为什么要关心?因为如果不知道所使用的字符集和编码,您就无法真正知道给定的字节流代表什么字符。例如,字节 0xDE 可以编码

  • Þ (LATIN CAPITAL LETTER THORN)
  • fi (LATIN SMALL LIGATURE FI)
  • ή (GREEK SMALL LETTER ETA WITH TONOS)
  • 或 13 个其他字符,具体取决于所使用的编码和字符集。

If you want a really brief introduction:
Unicode in 5 Minutes

Or if you are after one-liners:

  • Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
  • UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
  • UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
  • UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
  • UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
  • Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
  • ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
  • ANSI: a standards body
  • Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused

Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode

  • Þ (LATIN CAPITAL LETTER THORN)
  • fi (LATIN SMALL LIGATURE FI)
  • ή (GREEK SMALL LETTER ETA WITH TONOS)
  • or 13 other characters, depending on the encoding and character set used.
标点 2024-08-12 12:34:07

除了经常引用的 Joel 文章之外,我还有我自己的文章,其中介绍了从以 .NET 为中心的角度来看,只是为了多样性......

As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...

静赏你的温柔 2024-08-12 12:34:07

是的,我得到了一些见解,但可能是错误的,但它帮助我理解了它。

让我们看一些文字。它作为一系列字节存储在计算机内存中,代码页只是您和我读取的字节和字符之间的映射表。因此,像记事本这样的东西会附带它的代码页,并将字节转换到屏幕上,你会看到一堆垃圾、颠倒的问号等。这并不意味着你的数据是乱码,只是读取字节的应用程序没有使用正确的格式。代码页。某些应用程序比其他应用程序更擅长检测要使用的正确代码页,并且内存中的某些字节流包含代表字节顺序标记的 BOM,这可以声明要使用的正确代码页。

UTF7、8 16 等只是使用不同格式的不同代码页。

使用不同代码页存储为字节的同一文件将具有不同的文件大小,因为字节存储方式不同。

它们与 Windows 1252 并没有真正的区别,因为这只是另一个代码页。

要获得更好、更智能的答案,请尝试其中一个链接。

Yea I got some insight but it might be wrong, however it's helped me to understand it.

Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.

UTF7, 8 16 etc are all just different codepages using different formats.

The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.

They also don't really differ from windows 1252 as that's just another codepage.

For a better smarter answer try one of the links.

深海少女心 2024-08-12 12:34:07

其他人已经指出了足够好的参考资料。我列出的并不是真正的傻瓜指南,而是来自 Unicode Consortium 页面的一些指示。您可以在 Unicode Consortium 页面上找到使用不同编码的更多具体原因。

Unicode 常见问题解答 是回答您的一些(不是全部)疑问的好地方。

关于 Unicode 为何存在的更简洁的答案,请参见 Unicode 网站的新人部分 本身:

Unicode 提供了一个唯一的编号
每个角色,无论是什么
平台,无论什么程序,
无论使用什么语言。

就使用 UTF-8、UTF-16 或 UTF-32 的技术原因而言,答案在于 Unicode 技术介绍

UTF-8 在 HTML 和类似内容中很流行
协议。 UTF-8 是一种方式
转换所有 Unicode 字符
转换为可变长度编码
字节。它的优点是
对应的 Unicode 字符
熟悉的 ASCII 集有相同的
字节值作为 ASCII,并且 Unicode
字符转成UTF-8即可
与许多现有软件一起使用
无需大量软件重写。

UTF-16 在许多环境中都很流行
需要平衡有效访问
经济使用的角色
贮存。它相当紧凑并且
所有频繁使用的角色都适合
成单个 16 位代码单元,同时
所有其他字符均可访问
通过成对的 16 位代码单元。

UTF-32 流行的地方内存空间大
不关心,但是固定宽度,单一
访问字符的代码单元是
想要的。每个 Unicode 字符是
以单个 32 位代码单元编码
使用 UTF-32 时。

三种编码形式最多需要
每个 4 字节(或 32 位)数据
性格。

一般的经验法则是,当应用程序支持的主要语言是印度河以西的语言时,使用 UTF-8;相反的语言(印度河以东)使用 UTF-16;当您担心使用字符时,使用 UTF-32并统一存储。

顺便说一句,UTF-7 不是 Unicode 标准,主要设计用于邮件应用程序。

Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.

The Unicode FAQ is a good enough place to answer some (not all) of your queries.

A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:

Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.

As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:

UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.

UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.

UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.

All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.

A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.

By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.

╭ゆ眷念 2024-08-12 12:34:07

我不是在追求维基百科链接或令人难以置信的细节,只是一些关于 Unicode 巨大变化如何以及为何发生以及为什么作为程序员应该关心的简短信息。

首先,不存在“unicode 变体”。 Unicode 是一个标准,标准,用于将代码点(整数)分配给字符。 UTF8 是将这些整数表示为字节的最流行方式!

作为一名程序员为什么要关心这个问题?

  • 理解这一点很有趣!
  • 如果您对编码没有基本的了解,您很容易生成有错误的代码。

示例:您从某处收到一个 ByteArray myByteArray 并且您知道它代表字符。然后运行 ​​myByteArray.toString() 并获得字符串 Hello。你的程序有效!发送代码一天后,您的德国客户致电:“我们遇到问题,äöü 未正确显示!”。您开始调试代码,在对编码没有基本了解的情况下感到非常迷茫。然而,通过对编码的理解,您知道错误可能是这样的:运行myByteArray.toString()时,您的程序假定字符串是使用默认系统编码进行编码的。但也许事实并非如此!也许它是 UTF8,而你的系统是拉丁语系统,所以你应该运行 myByteArray.toString("UTF8")

资源:

我不会像其他人建议的那样推荐乔尔的文章。这是一篇很长的文章,有很多不相关的信息。几年前我读过它,但它的精髓并没有牢牢记住我的大脑,因为有很多不重要的细节。

正如已经提到的 http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes 是一个很棒的地方去争取掌握unicode的精髓。

如果您想真正了解像 UTF8 这样的可变长度编码,我建议 https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!

Why should you care as a programmer?

  • It's fun to understand this!
  • If you don't have basic understanding of encodings, you can easily produce buggy code.

Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!

Resources:

I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.

As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.

If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文