将 UTF-16 视为固定的 16 位编码会产生什么问题?

发布于 2024-10-18 12:51:56 字数 1396 浏览 8 评论 0原文

我正在阅读一些关于 Unicode 的问题,其中有一些我不完全理解的评论,例如:

迪恩·哈丁 :UTF-8 是 变长编码,即 处理起来比 固定长度编码。另外,请参阅我的 对 Gumbo 答案的评论:基本上, 组合字符存在于所有 编码(UTF-8、UTF-16 和 UTF-32)和 它们需要特殊处理。你可以 使用与您相同的特殊处理 用于组合字符也 处理 UTF-16 中的代理对,所以 大多数情况下你可以忽略 代理并像对待 UTF-16 一样 固定编码。

我对最后一部分(“大部分”)有点困惑。如果UTF-16被视为固定的16位编码,这会导致什么问题?存在 BMP 之外的字符的可能性有多大?如果存在,如果您假设为两字节字符,这可能会导致什么问题?

我阅读了维基百科信息 代理人,但它并没有真正让事情对我来说更清楚!

编辑:我想我真正的意思是“为什么有人建议将UTF-16视为固定编码,而它看起来是假的?”

Edit2:

我在“< a href="https://stackoverflow.com/questions/2934809/is-there-any-reason-to-prefer-utf-16-over-utf-8">是否有任何理由比 UTF 更喜欢 UTF-16 -8?”我认为这更好地解释了这一点:

安德鲁·拉塞尔< /a>:为了性能: UTF-8 比 UTF-8 更难解码 UTF-16。 UTF-16 中的字符是 基本多语言飞机 字符(2 个字节)或代理项 对(4 字节)。 UTF-8字符即可 可以是 1 到 4 个字节之间的任意位置

位置 这表明,UTF-16 不会有任何三字节字符,因此假设 16 位,您就不会因为少了一个字节而“彻底搞砸”。但我仍然不相信这与假设 UTF-8 是单字节字符有什么不同!

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:

Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.

I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?

I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!

Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"

Edit2:

I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:

Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes

This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

妳是的陽光 2024-10-25 12:51:56

UTF-16 包括所有 “基本平面”字符。 BMP 涵盖了大多数当前的书写系统,并包括许多人们实际上可以遇到的较旧的字符。看看它们,然后决定你是否真的会遇到来自扩展位面的任何字符:楔形文字、炼金术符号等。很少有人会真正错过它们。

如果您仍然遇到需要扩展平面的字符,这些字符将由两个代码点(代理项)进行编码,您将看到两个空方块或问号,而不是这样的非字符。 UTF 是自同步的,因此代理字符的一部分永远不会看起来像合法字符。即使存在代理项并且您不处理它们,这也允许字符串搜索之类的操作正常工作。

因此,除了不处理扩展字符这一事实之外,将 UTF-16 视为有效的 USC-2 所产生的问题很少。

编辑: Unicode 使用在前一个字符的空间处呈现的“组合标记”,例如重音符号、波形符、音调符号等。有时,变音符号与字母的组合可以表示为不同的代码点,例如 á 可以表示为单个 \u00e1,而不是简单的 'a' + 重音符号,即 \u0061\u0301。尽管如此,您仍无法将 等不寻常的组合表示为一个代码点。这使得搜索和分割算法变得更加复杂。如果您以某种方式使字符串数据统一(例如,仅使用普通字母和组合标记),搜索和拆分将再次变得简单,但无论如何您都会失去“一个位置是一个字符”属性。如果您认真进行排版并想要显式存储像 fi 这样的连字,其中一个代码点对应 2 或 3 个字符,就会出现对称问题。这不是 UTF 问题,而是一般的 Unicode 问题,AFAICT。

UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.

If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.

Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.

EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like or where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.

心不设防 2024-10-25 12:51:56

重要的是要了解,即使 UTF-32 在涉及代码点(而不是字符)时也是固定长度的。有许多字符由多个代码点组成,因此您无法真正拥有一种 Unicode 编码,其中一个数字(代码单元)对应于一个字符(如用户所感知)。

回答你的问题 - 将 UTF-16 视为固定长度编码形式的最明显问题是在代理对中间破坏字符串,这样你就会得到两个无效的代码点。这完全取决于你对文本的处理方式。

It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).

To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.

徒留西风 2024-10-25 12:51:56

我想我真正的意思是
“为什么有人会建议治疗
UTF-16 看起来像是固定编码
假的?”

两个词:向后兼容性。

Unicode 最初打算使用固定宽度的 16 位编码 (UCS-2),这就是 Unicode 的早期采用者(例如,Sun 的 Java 和Microsoft 和 Windows NT)使用了 16 位字符类型,当发现 65,536 个字符不足以满足每个人的需要时,开发了 UTF-16,以便允许此 16 位字符系统表示 16 个新的“平面”。 ”

这意味着字符不再是固定宽度的,因此人们创建了这样的合理化:“没关系,因为 UTF-16几乎 是固定宽度的。

但我仍然不相信这是
与假设 UTF-8 有什么不同
单字节字符!

严格来说,这没有任何不同。对于诸如 "\uD801\uDC00 之类的内容,您会得到不正确的结果“.lower()

但是,假设 UTF-16 是固定宽度比假设 UTF-8 是固定宽度更不可能出现损坏。非 ASCII 字符在英语以外的语言中很常见,但非 BMP 字符非常罕见。

您可以使用相同的特殊处理
用于组合字符的
还可以处理代理对
UTF-16

我不知道他在说什么。组合序列的组成字符具有单独的身份,与代理字符完全不同,代理字符只有成对才有意义。

特别是,组合序列内的字符可以一次一个字符地转换为不同的编码形式。

>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'

但不是代理人:

>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed

I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"

Two words: Backwards compatibility.

Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".

This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."

But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!

Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().

However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.

You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16

I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.

In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.

>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'

But not surrogates:

>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
骄兵必败 2024-10-25 12:51:56

UTF-16 是一种可变长度编码。较旧的 UCS-2 则不然。如果您将可变长度编码视为固定(恒定长度),则每当您使用“16 位数字的数量”表示“字符数”时,您都会面临引入错误的风险,因为字符数实际上可能小于字符数。 16 位数量。

UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.

情绪失控 2024-10-25 12:51:56

Unicode 标准在此过程中已发生过多次更改。例如,UCS-2 不再是有效的编码。它已经被弃用了一段时间了。

正如用户 9000 所提到的,即使在 UTF-32 中,字符序列也是相互依赖的。 à 就是一个很好的例子,尽管这个字符可以规范化为 \x00E1。所以你可以让它变得简单

即使使用 UTF-32 编码,Unicode 也最多支持 30 个代码点,一个接着一个,以表示最复杂的字符。 (现有的字符并没有使用那么多,如果我是正确的话,我认为目前存在的最长字符是 17 个。)

出于这个原因,Unicode 开发了 标准化形式。它实际上考虑了五种不同的形式:

  1. 非标准化——例如,您手动创建的序列;文本编辑器应保存正确的标准化(NFC)代码序列
  2. NFD - 标准化形式分解
  3. NFKD - 标准化形式兼容性分解
  4. NFC - 标准化形式规范组合
  5. NFKC - 标准化形式兼容性规范组合

尽管在大多数情况下它确实如此没关系,因为长篇文章很少见,即使在使用长篇文章的语言中也是如此。

在大多数情况下,您的代码已经处理规范的组合。但是,如果您在代码中手动创建字符串,则不太可能创建非规范化字符串(假设您使用如此长的形式)。

Internet 上正确实现的服务器应该拒绝不符合 Unicode 规范的字符串。连接上也禁止长形式。例如,UTF-8 编码在技术上允许使用 1、2、3 或 4 个字节对 ASCII 字符进行编码(旧的编码最多允许 6 个字节!),但这些编码是不允许的。

Internet 上任何与 Unicode 规范化表文档相矛盾的评论都是不正确的。

The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.

As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.

Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)

For that reason, Unicode developed Normalization Forms. It actually considers five different forms:

  1. Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
  2. NFD -- Normalization Form Decomposition
  3. NFKD -- Normalization Form Compatibility Decomposition
  4. NFC -- Normalization Form Canonical Composition
  5. NFKC -- Normalization Form Compatibility Canonical Composition

Although in most situations it does not matter much because long compositions are rare, even in languages that use them.

And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).

Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.

Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文