Unicode 中是否存在每个“字符”都包含的编码?只是一个代码点吗?

发布于 2024-10-10 16:32:47 字数 140 浏览 0 评论 0原文

尝试改写:您能否将每个组合字符组合映射到一个代码点?

我是 Unicode 的新手,但在我看来,在 Unicode 中,没有一种编码、规范化或表示方式可以使一个字符在每种情况下都是一个代码点。这是正确的吗?

基础多语言飞机也是如此吗?

Trying to rephrase: Can you map every combining character combination into one code point?

I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?

Is this true for Basic Multilingual Plane also?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

幸福%小乖 2024-10-17 16:32:47

如果你的意思是一个字符==一个数字(即:每个字符由相同数量的字节/单词/你有什么表示):在UCS-4中,每个字符由一个4字节数字表示。对于每个字符都可以用单个值表示来说,这已经足够大了,但是如果您不需要任何更高的字符,那就非常浪费了。

如果您指的是兼容性序列(即:其中 e + ´ => é):现有现代语言中使用的大多数组合都有单字符表示形式。如果你正在编写自己的语言,你可能会遇到问题……但如果你坚持使用人们实际使用的语言,那就没问题了。

If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.

If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.

青朷 2024-10-17 16:32:47

你能映射每个组合字符吗
组合成一个代码点?

每个组合字符组合?您建议的编码将如何表示字符串“à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍字符串̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (一个带有一百多个组合标记的“a”?)这并不实用。

然而,Unicode 中有很多“预组合”字符,例如 áçñü。标准化形式 C 将尽可能使用这些而不是分解版本。

Can you map every combining character
combination into one code point?

Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.

There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.

拧巴小姐 2024-10-17 16:32:47

在我看来,在 Unicode 中,没有一种编码、规范化或表示方法可以使一个字符在每种情况下都是一个代码点。这是正确的吗?

取决于“字符”一词的含义。 Unicode 具有抽象字符(标准第 3 章中的定义 7:“用于组织、控制或表示文本数据的信息单元”)和编码字符<的概念/em> (定义 11:“抽象字符和代码点之间的关联(或映射)”)。因此,字符永远不是代码点,但对于许多代码点,存在映射到代码点的抽象字符,这种映射称为“编码字符”。但是(定义 11,第 4 段):“单个抽象字符也可以由代码点序列表示

基本多语言飞机也是如此吗?

BMP 和其他平面之间不存在与抽象或编码字符相关的概念差异。上面的语句适用于代码空间的所有子集。

根据您的应用程序,您必须区分术语字形字形簇字形抽象字符编码字符代码点标量值代码单元字节。所有这些概念都是不同的,它们之间没有简单的映射。特别是,这些实体之间几乎从来不存在一对一的映射。

it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?

Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”

Is this true for Basic Multilingual Plane also?

There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.

Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文