Unicode 连字字符在 UTF8 中是否可以有多种表示形式?
Unicode 连字字符 fi
(Unicode U+FB01) 在 UTF8 中是否可以有多个表示形式?哪一个?对于每个标准化形式?
Can have an unicode ligature character fi
(Unicode U+FB01) more than one representation in UTF8? Which one? For each normalization form?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
该字符应编码为
0xEF 0xAC 0x81
在utf-8
中,但同一个字符可以依次分解为f
和i
,它们合起来就是0x66 0x69
。您的问题实际上由 unicode 规范中的此图表直接回答:如您所见,
NFD
和NFC
规范化仍然使用相同的连字代码点,而NFKD
和NFKC
形式使用f
+i
组合。The character should be encoded as
0xEF 0xAC 0x81
inutf-8
but the same character can be decomposed to anf
and ani
in sequence, which together are0x66 0x69
. Your question is actually answered directly by this chart from the unicode specification:As you can see, the
NFD
andNFC
normalizations are still using the same codepoint for the ligature while theNFKD
andNFKC
forms use thef
+i
combination.这取决于“字符”的含义,而“字符”的含义相当晦涩。在 Unicode 中,“字符”通常表示分配给字符的代码点,这确实符合“字符”的直观概念。
单个代码点(例如 U+FB01)在 UTF-8 中只有一种表示形式,因为 UTF-8 定义了一种用于生成编码形式的明确算法。
诸如丝线之类的直观字符可能具有不同的表示形式作为代码点或代码点序列,每个代码点都具有 UTF-8 表示形式。 Unicode 规范化规则部分定义了此类替代项之间的映射。
但 U+FB01 的兼容性映射(到 U+0066 U+0069,即“f”后跟“i”)不会保留直观字符的标识:连字被映射到两个普通字母。
另一方面,您可以通过在两个字母(例如“f”和“i”)之间插入 U+200D ZERO WIDTH JOINER (ZWJ) 来请求或建议连字行为。从某种意义上说,序列U+0066 U+200D U+0069是丝连字的另一种表示,但这不是字符的形式属性,取决于渲染软件是否关注ZWJ。
This depends on the meaning of “character,” which is rather obscure. In Unicode, “character” usually means a codepoint assigned to a character, and this does match exactly the intuitive concept of “character.”
A single codepoint, such as U+FB01, has only one representation in UTF-8, because UTF-8 defines an unambiguous algorithm for generating the encoded form.
An intuitive character, such as the fi ligature, may have different representations as a codepoint or as a sequence of codepoints, which each have UTF-8 representations. Unicode normalization rules define, in part, mappings between such alternatives.
But the compatibility mapping for U+FB01 (to U+0066 U+0069, i.e. “f” followed by “i”) does not preserve the identity of an intuitive character: the ligature is mapped to two normal letters.
On the other hand, you can ask for, or suggest, ligature behavior by inserting U+200D ZERO WIDTH JOINER (ZWJ) between two letters, like “f” and “i”. In a sense, the sequence U+0066 U+200D U+0069 is an alternative representation of the fi ligature, but this is not a formal property of character, and it depends on rendering software whether it pays attention to ZWJ.