“编码”和“编码”之间有什么区别? “字符集”,和“代码页”?

发布于 2024-09-13 11:43:51 字数 285 浏览 2 评论 0原文

我真的很努力在这些方面做得更好。我对这样的国际化概念非常熟悉,但我需要更好地了解其背后的理论背景。

我读过 Spolsky 的文章,但我仍然不清楚,因为这三个术语得到很多可以互换使用——甚至在那篇文章中也是如此。我认为至少有两个人在谈论同一件事。

我怀疑很大一部分开发人员每天都在搞这些事情。我不想再成为那些开发人员中的一员。

I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.

I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a lot—even in that article. I think at least two of them are talking about the same thing.

I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

赢得她心 2024-09-20 11:43:51

“字符集”顾名思义:正确指定的不同字符列表。

“编码”是字符集(现在通常是 Unicode)和字符的技术表示(通常基于字节)之间的映射。

UTF-8是一种编码,但不是字符集。它是 Unicode 字符集 (*) 的编码。

之所以会出现这种混乱,是因为大多数其他众所周知的编码(例如:ISO-8859-1)最初都是作为单独的字符集。然后,当 Unicode 作为大多数这些字符集的超集出现时,可以将它们视为同一 (Unicode) 字符集的不同(但部分)编码,而不仅仅是孤立的字符集。以这种方式查看它们可以让您轻松地通过 Unicode 在它们之间进行转换,如果它们只是孤立的字符集,则这是不可能的。但将它们称为字符集仍然有意义,因此可以使用任何一个术语。

“代码页”是源自 IBM 的一个术语,它选择要显示的符号集。该术语继续被 DOS 使用,然后是 Windows,一直到支持 Unicode 的 Windows,它仅充当带有编号标识符的编码。虽然编号“代码页”这个想法本质上并不局限于 Microsoft,但如今该术语几乎总是意味着 Windows 知道的编码。

当人们谈论代码页<某个数字>时,通常是在谈论Windows特定的编码,与标准机构设计的编码不同。例如,代码页 28591 通常不会以该名称引用,而是简单地引用“ISO-8859-1”。基于 ISO-8859-1 的 Windows 特定西欧编码(用一些额外字符替换其一些控制代码)通常称为“代码页 1252”。

[*:所有的UTF都是编码而不是字符集,但是这种东西并不是Unicode独有的。例如,日本标准 JIS X 0208 定义了一个字符集和两种不同的字节编码:有点令人不快的基于高字节的编码(“Shift-JIS”)和非常可怕的基于转义切换的编码(“JIS”) ')。]

A ‘character set’ is just what it says: a properly-specified list of distinct characters.

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.

A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.

When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.

[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]

耳根太软 2024-09-20 11:43:51

字符集就是一组可以使用的字符。
每个字符都映射到一个称为代码点的整数。
这些代码点在内存中的表示方式就是编码。编码只是将代码点(U+0041 - 字符“A”的 Unicode 代码点)转换为原始数据(位和字节)的方法。

A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).

悲凉≈ 2024-09-20 11:43:51

字符集是一组字符,即“字形”,即表示通信单元的视觉符号。字母 a 是一个字形,(欧元符号)也是一个字形。字符集通常将整数(代码点)映射到每个字符,但编码决定了字符的二进制/字节级表示形式。

我是一名 Ruby 程序员,因此这里有一些示例可以帮助您理解这些概念。

这揭示了 Unicode 字符集如何将代码点映射到字符,但没有揭示每个字节的存储方式。 (Ruby 1.9 默认使用 Unicode 字符串。)

>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]

由于 8364(基数 10)太大,无法容纳在一个字节中,因此存在各种编码策略来指定从 Unicode 代码点到一个或多个字节。 UTF-8编码可能是这些编码中最流行的。 (如果您想深入了解其实现,维基百科显示了 UTF-8 编码算法。)请注意,UTF-8 编码仅在 Unicode 字符集的上下文中才有意义。

下面揭示了 UTF-8 编码如何将每个 Unicode 字符存储为字节(以 10 为基数的 0 到 255)。 (Ruby 1.9 的默认编码是 UTF-8。)

>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]

ISO-8859-15 字符集

>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]

以及 ISO-8859-15 编码

>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]

请注意,ISO-8859-15 代码点与字节表示。

以下博客文章可能会有所帮助:什么是字符编码?。如果您不想过于了解 Ruby,那么条目 1 到 3 是不错的选择。

A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary/byte-level representation of the character.

I'm a Ruby programmer, so here are some examples to help you understand the concepts.

This reveals how the Unicode character set maps codepoints to characters, but not how each byte is stored. (Ruby 1.9 defaults to Unicode strings.)

>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]

Since 8364 (base 10) is too large to fit in one byte, various encoding strategies exist to specify a translation from Unicode codepoints into one or many bytes. The UTF-8 encoding is probably the most popular of these encodings. (Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.) Note that the UTF-8 encoding only makes sense in the context of the Unicode character set.

The following reveals how the UTF-8 encoding stores each Unicode character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.)

>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]

Here's the same thing in ISO-8859-15 character set:

>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]

And the ISO-8859-15 encoding:

>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]

Notice that the ISO-8859-15 codepoints match the byte representation.

Here's a blog entry that might be helpful: What is a Character Encoding?. Entries 1 through 3 are good if you don't want to get too Ruby-specific.

平生欢 2024-09-20 11:43:51

我认为 Joel 的文章非常正确 - 正是字符集和存储演变背后的历史导致了这一点。

FWIW,在我过于简单化的观点中,

  • 字符集(ASCII、EBCDIC、UNICODE)将是字符的数字表示形式,与存储考虑因素无关
  • 编码将与文件的字符、ANSI、UTF-7、UTF-8 等的有效存储有关 。
  • 当需要添加新字符(不想增加存储容量)意味着(某些)字符仅在代码页的附加上下文中才可知时,代码页将是所需的“克鲁格”

恕我直言,维基百科目前无法通过将 代码页 定义为“字符编码的另一个名称”来帮助解决问题
并将“字符集”重定向到“字符编码”

I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.

FWIW, in my oversimplistic view

  • Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
  • Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
  • Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.

IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding'
and redirecting 'character set' to 'character encoding'

源来凯始玺欢你 2024-09-20 11:43:51

本书中有关 Unicode 的章节高级 Perl 编程包含了对编码、字符集和我遇到的 unicode 的其他实体。不幸的是,我认为它不能在网上免费提供。

The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文