Unicode 连字字符在 UTF8 中是否可以有多种表示形式?

发布于 2025-01-07 19:36:36 字数 87 浏览 0 评论 0原文

Unicode 连字字符 fi (Unicode U+FB01) 在 UTF8 中是否可以有多个表示形式?哪一个?对于每个标准化形式?

Can have an unicode ligature character fi (Unicode U+FB01) more than one representation in UTF8? Which one? For each normalization form?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吃→可爱长大的 2025-01-14 19:36:36

该字符应编码为 0xEF 0xAC 0x81utf-8 中,但同一个字符可以依次分解为 fi,它们合起来就是0x66 0x69。您的问题实际上由 unicode 规范中的此图表直接回答:

连字的规范化形式

如您所见,NFDNFC 规范化仍然使用相同的连字代码点,而 NFKDNFKC 形式使用 f + i 组合。

The character should be encoded as 0xEF 0xAC 0x81 in utf-8 but the same character can be decomposed to an f and an i in sequence, which together are 0x66 0x69. Your question is actually answered directly by this chart from the unicode specification:

normalized forms of ligatures

As you can see, the NFD and NFC normalizations are still using the same codepoint for the ligature while the NFKD and NFKC forms use the f + i combination.

七堇年 2025-01-14 19:36:36

这取决于“字符”的含义,而“字符”的含义相当晦涩。在 Unicode 中,“字符”通常表示分配给字符的代码点,这确实符合“字符”的直观概念。

单个代码点(例如 U+FB01)在 UTF-8 中只有一种表示形式,因为 UTF-8 定义了一种用于生成编码形式的明确算法。

诸如丝线之类的直观字符可能具有不同的表示形式作为代码点或代码点序列,每个代码点都具有 UTF-8 表示形式。 Unicode 规范化规则部分定义了此类替代项之间的映射。

但 U+FB01 的兼容性映射(到 U+0066 U+0069,即“f”后跟“i”)不会保留直观字符的标识:连字被映射到两个普通字母。

另一方面,您可以通过在两个字母(例如“f”和“i”)之间插入 U+200D ZERO WIDTH JOINER (ZWJ) 来请求或建议连字行为。从某种意义上说,序列U+0066 U+200D U+0069是丝连字的另一种表示,但这不是字符的形式属性,取决于渲染软件是否关注ZWJ。

This depends on the meaning of “character,” which is rather obscure. In Unicode, “character” usually means a codepoint assigned to a character, and this does match exactly the intuitive concept of “character.”

A single codepoint, such as U+FB01, has only one representation in UTF-8, because UTF-8 defines an unambiguous algorithm for generating the encoded form.

An intuitive character, such as the fi ligature, may have different representations as a codepoint or as a sequence of codepoints, which each have UTF-8 representations. Unicode normalization rules define, in part, mappings between such alternatives.

But the compatibility mapping for U+FB01 (to U+0066 U+0069, i.e. “f” followed by “i”) does not preserve the identity of an intuitive character: the ligature is mapped to two normal letters.

On the other hand, you can ask for, or suggest, ligature behavior by inserting U+200D ZERO WIDTH JOINER (ZWJ) between two letters, like “f” and “i”. In a sense, the sequence U+0066 U+200D U+0069 is an alternative representation of the fi ligature, but this is not a formal property of character, and it depends on rendering software whether it pays attention to ZWJ.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文