UTF8 是单射映射吗?

发布于 2024-12-15 10:16:49 字数 148 浏览 2 评论 0原文

我们编写一个 C++ 应用程序,需要知道这一点:

UTF8 文本编码是否是从字节到字符的单射映射,这意味着每个字符(字母...)都被编码仅以一种方式?因此,例如字母“Ž”不能同时编码为 3231 和 32119。

We write a C++ application and need to know this:

Is UTF8 text encoding an injective mapping from bytes to characters, meaning that every single character (letter...) is encoded in only one way? So, e.g. letter 'Ž' cannot be encoded as, say, both 3231 and 32119.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小嗷兮 2024-12-22 10:16:49

这很大程度上取决于您对“信件”的看法。

UTF8 基本上是 Unicode 的一小部分。

基本上至少有三个级别:字节、代码点和字素簇。
代码点可以根据某种编码(如 UTF8、UTF16 或 UTF32)编码为一个或多个字节。这种编码是唯一的(因为所有替代方式都被声明为无效)。然而,代码点并不总是字形,因为存在所谓的组合字符。这种组合字符跟随基本字符,并且正如其名称所示,与基本字符组合。例如,组合字符 U+0308 COMBINING DIAERESIS 会将分音符 (¡) 放在前面的字母上方。因此,如果它后面是 a (U+0061 LATIN SMALL LETTER A),则结果是 ä。然而,字母 ä 也有一个代码点(U+00E4 带分音符号的拉丁文小写字母 A),因此这意味着代码序列 U+0061 U+0308 和 U+00E4 描述同一个字母。

因此,每个代码点都有一个有效的 UTF 8 编码(例如 U+0061 是“\141”,U+0308 是“\314\210”,U+00e4 是“\303\244”,但字母 ä 是由代码点序列 U+0061 U+0308(即 UTF8 中的字节序列“\141\314\210”)和单个代码点 U+00E4,即字节序列“\303\244”,

更糟糕的是,由于 Unicode 制定者决定将组合字母放在基本字母之后而不是前面,因此您无法知道。在您看到下一个代码点之前,您的字形是否完整(如果它不是组合代码点,则您的字母已完成)。

That depends very much on what you consider a "letter".

UTF8 is basically a tiny piece of what is Unicode.

Basically there are at least three levels: Bytes, Code points and Grapheme clusters.
A Code point can be encoded in one or more bytes, according to a certain encoding, like UTF8, UTF16 or UTF32. This encoding is unique (because all alternative ways are declared invalid). However a code point is not always a glyph because there are so-called combining characters. Such combining characters follow the base character and, as their name says, are combined with the base character. For example, there's the combining character U+0308 COMBINING DIAERESIS which puts a diaeresis (¨) above the preceding letter. So if it follows e.g. an a (U+0061 LATIN SMALL LETTER A), the result is an ä. However there's also a single code point for the letter ä (U+00E4 LATIN SMALL LETTER A WITH DIAERESIS), so this means that the code sequences U+0061 U+0308 and U+00E4 describe the same letter.

So, each code point has a single valid UTF 8 encoding (e.g. U+0061 is "\141", U+0308 is "\314\210" and U+00e4 is "\303\244", but the letter ä is encoded by both the code point sequence U+0061 U+0308, i.e. in UTF8 the byte sequence "\141\314\210" and the single code point U+00E4, i.e. the byte sequence "\303\244".

What's worse is that since the Unicode makers decided that the combining letters follow the base letter instead of preceding it, you cannot know whether your glyph is complete until you've seen the next code point (if it is not a combining code point, your letter is finished).

愁杀 2024-12-22 10:16:49

有效 UTF-8 确实对每个字符进行唯一编码。然而,存在所谓的超长序列,其符合通用编码方案,但根据定义是无效的,因为只能使用最短序列来对字符进行编码。

例如,UTF-8 的一个衍生版本称为修改 UTF-8,它将 NUL 编码为超长序列 0xC0 0x80 而不是 0x00 来获得与空终止字符串兼容的编码。

如果您询问的是字素簇(即用户感知的字符)而不是字符,那么即使有效的 UTF-8 也是不明确的。但是,Unicode 定义了几种不同的规范化形式,如果您将自己限制为规范化字符串,则 UTF-8确实是单射的。

有点偏离主题:这是我想出的一些 ASCII 艺术,可以帮助可视化字符的不同概念。垂直分隔的是抽象机器级别。请随意想出更好的名称...

                         [user-perceived characters]<-+
                                      ^               |
                                      |               |
                                      v               |
            [characters] <-> [grapheme clusters]      |
                 ^                    ^               |
                 |                    |               |
                 v                    v               |
[bytes] <-> [codepoints]           [glyphs]<----------+

回到主题:该图还显示了使用字节比较抽象字符串时可能出现的问题。特别是(假设 UTF-8),程序员需要确保

  • 字节序列有效,即不包含过长的序列或编码非字符代码点,
  • 字符序列已标准化,因此等效的字素簇具有唯一的表示形式

Valid UTF-8 indeed encodes each character uniquely. However, there are so-called overlong sequences which conform to the general encoding scheme, but are invalid by definition as only the shortest sequence may be used to encode a character.

For example, there's a derivative of UTF-8 called modified UTF-8 which encodes NUL as the overlong sequence 0xC0 0x80 instead of 0x00 to get an encoding compatible with null-terminated strings.

If you're asking about grapheme clusters (ie user-perceived characters) instead of characters, then even valid UTF-8 is ambiguous. However, Unicode defines several different normalization forms, and if you restrict yourself to normalized strings, then UTF-8 is indeed injective.

Somewhat off-topic: Here's some ASCII art I came up with to help visualize the different concepts of character. Vertically separated are the human, abstract and machine level. Feel free to come up with better names...

                         [user-perceived characters]<-+
                                      ^               |
                                      |               |
                                      v               |
            [characters] <-> [grapheme clusters]      |
                 ^                    ^               |
                 |                    |               |
                 v                    v               |
[bytes] <-> [codepoints]           [glyphs]<----------+

To get back on topic: This graph also shows where the possible problems may crop up when using bytes to compare abstract strings. In particular (assuming UTF-8), the programmer needs to make sure that

  • the byte sequence is valid, ie doesn't contain overlong sequences or encode non-character codepoints
  • the character sequence is normalized so equivalent grapheme clusters have a unique representation
与他有关 2024-12-22 10:16:49

首先你需要一些术语:

  • 字母:(抽象概念,不在 Unicode 中)你想要表示的一些字母或符号。
  • 代码点:与 Unicode 字符关联的数字。
  • 字素簇:对应于单个字母的 Unicode 代码点序列,例如:a + ́ 代表字母 á.
  • 字形:(字体级别的概念,不是 Unicode 中的):字母的图形表示。

每个代码点(例如:U+1F4A9)都有一个唯一的 UTF-8 字节表示形式(例如:0xF0 0x9F 0x92 0xA9)。

一些字母可以用几种不同的方式表示为代码点(即:作为不同的字素簇)。例如:á 可以表示为单个代码点 á(带有锐音的拉丁文小写字母 A),也可以表示为 a 的代码点>(拉丁文小写字母 A)+ ́ 的代码点(组合尖锐重音)。 Unicode 有几种规范化形式来处理这个问题(例如:NFC 或规范化形式 C 是松散的规范化形式,代码点较少,而 NFD 是完全分解的)。

然后,还有连字(例如:fi)和一些其他与表示相关的字母变体(例如:上标、不间断空格、在单词的不同位置具有不同形状的字母, ...)。其中一些采用 Unicode 格式,以允许旧字符集与旧字符集之间的无损往返转换。 Unicode 有兼容性规范化形式(NFKC 和 NFKD)来处理这个问题。

First you need some terminology:

  • Letter: (abstract concept, not in Unicode) some letter or symbol you want to represent.
  • Codepoint: a number associated to an Unicode character.
  • Grapheme cluster: a sequence of Unicode codepoints that correspond to a single letter, e.g: a + ́ for the letter á.
  • Glyph: (concept at the level of fonts, not in Unicode): a graphical representation of a letter.

Each codepoint (e.g: U+1F4A9) gets a unique representation as bytes in UTF-8 (e.g: 0xF0 0x9F 0x92 0xA9).

Some letters can be represented in several different ways as codepoints (i.e: as different grapheme clusters). e.g: á can be represented as a single codepoint á (LATIN SMALL LETTER A WITH ACUTE), or it can be represented as the codepoint for a (LATIN SMALL LETTER A) + the codepoint for ́ (COMBINING ACUTE ACCENT). Unicode has several canonical normalization forms to deal with this (e.g: NFC or Canonical Normalization Form C is loosely a normalization form with fewer codepoints, while NFD is fully decomposed).

And then, there are also ligatures (e.g: ) and some other presentation-related variations of a letter (e.g: superscripts, no-break spaces, letters with different shapes at different places of a word, ...). Some of these are in Unicode to permit lossless roundtrip conversion from-to legacy character sets. Unicode has compatibility normalization forms (NFKC and NFKD) to deal with this.

流绪微梦 2024-12-22 10:16:49

是的。 UTF-8 只是编码 Unicode 字符的标准方法。它的设计目的是为了让每个 Unicode 字符只有一种编码方式。

有点题外话:知道某些字符在外观上(与人类)非常相似可能很有用,但它们仍然不同 - 例如,西里尔文中有一个符号看起来与“/”非常相似。

Yes. UTF-8 is just a standard way to encode Unicode characters. It was made so that there is only one way to encode each of the Unicode characters.

A bit off-topic: it might be useful to know that some characters are very similar in look (to humans), but they are still different - for instance there is a sign in Cyrillic that looks very similar to '/'.

醉城メ夜风 2024-12-22 10:16:49

是的,有点像。如果使用得当,每个 unicode 代码点只能以 UTF-8 的一种方式进行编码,但这部分是因为要求任何字符只能使用最短的适用 UTF-8 字节序列。

然而,如果不满足这一要求,用于对字符进行编码的方法可能会以多种方式对许多字符进行编码 - 尽管不正确,但在某些情况下会这样做。

例如,“Z”可以编码为 0x5a{0xc1, 0x9a} (等等),尽管唯一的 0x5a 被认为是正确的,因为它是最短的序列。

Yes, sort of. If used properly, each unicode code point should only be encoded one way in UTF-8, but that's partly because of the requirement that only the shortest applicable UTF-8 byte sequence should be used for any character.

The method used to encode the characters, however, could encode many characters more than one way if not for this requirement -- and though not proper, there are some cases where this is done.

For example, 'Z' could be encoded as 0x5a or {0xc1, 0x9a} (among others) though the only 0x5a is considered correct because it is the shortest sequence.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文