什么是多字节字符集?

发布于 2024-07-18 03:34:48 字数 113 浏览 5 评论 0原文

术语“多字节”是指其字符可以(但不必)宽于 1 个字节的字符集(例如 UTF-8),还是指在任何情况下都宽于 1 个字节的字符集(例如 UTF) -16) ? 换句话说:如果有人谈论多字节字符集,这意味着什么?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

梦与时光遇 2024-07-25 03:34:48

该术语含糊不清,但在我的国际化工作中,我们通常避免使用术语“多字节字符集”来指代基于 Unicode 的编码。 一般来说,我们仅将这一术语用于具有一个或多个字节来定义每个字符的传统编码方案(不包括每个字符仅需要一个字节的编码)。

通常包括 Shift-jis、jis、euc-jp、euc-kr 以及中文编码。

大多数遗留编码(除了一些例外)都需要某种状态机模型(或者更简单地说,页面交换模型)来处理,并且在文本流中向后移动是复杂且容易出错的。 UTF-8 和 UTF-16 不会遇到此问题,因为 UTF-8 可以使用位掩码进行测试,而 UTF-16 可以针对一系列代理项对进行测试,因此在非病态文档中前后移动可以安全地完成,没有太大的复杂性。

对于泰语和越南语等语言,一些遗留编码具有多字节字符集的一些复杂性,但实际上只是建立在组合字符的基础上,并且通常不与广义术语“多字节”混为一谈。

The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).

Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.

Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.

A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."

撩人痒 2024-07-25 03:34:48

如果有人谈论多字节字符集,这意味着什么?

和往常一样,这取决于谁在说话!

从逻辑上讲,它应该包括UTF-8、Shift-JIS、GB等:变长编码。 UTF-16 通常不会被考虑在这一组中(尽管它有点像代理项;当然,当通过 UTF-16LE/UTF-16BE 编码为字节时,它是多个字节)。

但在 Microsoft 领域,该术语更通常用于表示可变长度的默认系统代码页(对于遗留的非 Unicode 应用程序,遗憾的是仍然有很多)。 在此用法中,不能包含 UTF-8 和 UTF-16LE/UTF-16BE,因为 Windows 上的系统代码页无法设置为这两种编码中的任何一种。

事实上,在某些情况下,“mbcs”只不过是系统代码页的同义词,也称为“ANSI”(甚至更容易误导)。 在这种情况下,“多字节”字符集实际上可能像 cp1252 西欧字符集一样简单,每个字符只使用一个字节!

我的建议:当你的意思是“可变长度”时,请使用“可变长度”,并避免使用含糊不清的术语“多字节”; 当其他人使用它时,您需要要求澄清,但通常具有 Windows 背景的人会谈论传统的东亚代码页,例如 cp932 (Shift-JIS),而不是 UTF。

What is meant if anybody talks about multibyte character sets?

That, as usual, depends on who is doing the talking!

Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).

But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.

Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!

My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.

﹉夏雨初晴づ 2024-07-25 03:34:48

所有没有 1 字节 = 1 字符映射的字符集。 所有 Unicode 变体以及亚洲字符集都是多字节的。

有关更多信息,我建议阅读这篇维基百科文章

All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.

For more information, I suggest reading this Wikipedia article.

聚集的泪 2024-07-25 03:34:48

多字节字符是指其编码需要超过 1 个字节的字符。 但这并不意味着使用该特定编码的所有字符都将具有相同的宽度(以字节为单位)。 例如:UTF-8 和 UTF-16 编码字符有时可能使用多个字节,而所有 UTF-32 编码字符始终使用 32 位。

参考文献:

A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.

References:

白芷 2024-07-25 03:34:48

多字节字符集可能由一字节和两字节组成
人物。 因此,多字节字符串可能包含以下内容的混合
单字节和双字节字符。

参考:单字节和多字节字符集

A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.

Ref: Single-Byte and Multibyte Character Sets

宣告ˉ结束 2024-07-25 03:34:48

UTF-8是多字节的,这意味着每个英文字符(ASCII)存储在1个字节中,而非英文字符如中文、泰语则存储在3个字节中。 当您将中文/泰文与英文混合使用时,例如“ทt”,第一个泰文字符“ท”使用 3 个字节,而第二个英文字符“t”仅使用 1 个字节。 设计多字节编码的人们意识到,英文字符不应该用3个字节来存储,而可以用1个字节来存储,这样会浪费存储空间。

UTF-16 以固定的 2 字节长度存储每个英语或非英语字符,因此它不是多字节,而是称为宽字符。 它非常适合中文/泰文语言,其中每个字符完全适合 2 个字节,但打印到 utf-8 控制台输出需要使用函数 wcstombs() 从宽字符转换为多字节格式。

UTF-32以固定的4字节长度存储每个字符,但由于浪费存储空间,没有人使用它来存储字符。

UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.

UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().

UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.

2024-07-25 03:34:48

通常是前者,即类 UTF-8。 有关详细信息,请参阅可变宽度编码

Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.

夜深人未静 2024-07-25 03:34:48

前者 - 尽管术语“可变长度编码”更合适。

The former - although the term "variable-length encoding" would be more appropriate.

‘画卷フ 2024-07-25 03:34:48

我通常用它来指代每个字符可以超过一个字节的任何字符。

I generally use it to refer to any character that can have more than one byte per character.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文