Unicode 联盟是否打算让 UTF-16 字符耗尽?

发布于 2025-01-08 03:32:39 字数 273 浏览 1 评论 0 原文

当前版本的 UTF-16 只能编码 1,112,064 个不同的数字(码点); 0x0-0x10FFFF

Unicode 联盟是否打算让 UTF-16 字符耗尽?

即创建一个代码点> 0x10FFFF

如果不是,为什么有人要编写 utf-8 解析器的代码来接受 5 或 6 字节序列?因为它会为其功能添加不必要的指令。

1,112,064 还不够吗,我们真的需要更多字符吗?我的意思是:我们多久就会用完?

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

i.e. make a code point > 0x10FFFF

If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.

Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

公布 2025-01-15 03:32:39

截至 2011 年我们已经消耗了 109,449 个字符并留出供应用程序使用(6,400+131,068)< /a>:

为超过 860,000 个未使用的字符留出空间;足够用于 CJK 扩展 E(约 10,000 个字符)以及另外 85 个类似的集合;因此,一旦接触到费伦吉文化,我们就应该做好准备。

2003 年 11 月,IETF 限制 UTF-8 以 U+10FFFF 结尾,RFC 3629,为了匹配 UTF-16 字符编码的约束:UTF-8 解析器不应接受会溢出 utf-16 集的 5 或 6 字节序列,或会溢出 utf-16 集的 4 字节序列中的字符大于 0x10FFFF

请将对 unicode 代码点限制大小构成威胁的编辑列表集放入此处,如果它们超过 CJK 扩展 E(约 10,000 个字符):

As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):

leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.

In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF

Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):

彩虹直至黑白 2025-01-15 03:32:39

目前,Unicode 标准没有定义任何高于 U+10FFFF 的字符,因此您可以对应用程序进行编码以拒绝高于该点的字符。

预测未来很困难,但我认为采用这种策略在短期内是安全的。老实说,即使 Unicode 在遥远的将来扩展到超过 U+10FFFF,它几乎肯定不会用于关键任务字形。您的应用程序可能与 2063 年推出的新 Ferengi 字体不兼容,但当它真正成为问题时,您始终可以修复它。

At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.

Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.

遗忘曾经 2025-01-15 03:32:39

切入正题:

编码系统确实是有意只支持最大到 U+10FFFF 的代码点。

似乎并不存在任何很快就会用完的真正风险。

Cutting to the chase:

It is indeed intentional that the encoding system only supports code points up to U+10FFFF

It does not appear that there is any real risk of running out any time soon.

怎会甘心 2025-01-15 03:32:39

除了支持实际使用它们的任何遗留系统之外,没有理由编写支持 5-6 字节序列的 UTF-8 解析器。当前官方 UTF-8 规范不支持 5-6 字节序列,以适应与 UTF-16 之间的 100% 无损转换。如果 Unicode 有一天必须支持 U+10FFFF 以上的新代码点,那么将有足够的时间为更高的位数设计新的编码格式。或者,也许到那时,内存和计算能力就足够了,每个人都会将所有内容切换到 UTF-32,它最多可以处理超过 40 亿个字符的 U+FFFFFFFF

There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文