Unicode 联盟是否打算让 UTF-16 字符耗尽?
当前版本的 UTF-16 只能编码 1,112,064 个不同的数字(码点); 0x0-0x10FFFF
。
Unicode 联盟是否打算让 UTF-16 字符耗尽?
即创建一个代码点> 0x10FFFF
如果不是,为什么有人要编写 utf-8 解析器的代码来接受 5 或 6 字节序列?因为它会为其功能添加不必要的指令。
1,112,064 还不够吗,我们真的需要更多字符吗?我的意思是:我们多久就会用完?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
截至 2011 年我们已经消耗了 109,449 个字符并留出供应用程序使用(6,400+131,068)< /a>:
为超过 860,000 个未使用的字符留出空间;足够用于 CJK 扩展 E(约 10,000 个字符)以及另外 85 个类似的集合;因此,一旦接触到费伦吉文化,我们就应该做好准备。
2003 年 11 月,IETF 限制 UTF-8 以 U+10FFFF 结尾,RFC 3629,为了匹配 UTF-16 字符编码的约束:UTF-8 解析器不应接受会溢出 utf-16 集的 5 或 6 字节序列,或会溢出 utf-16 集的 4 字节序列中的字符大于
0x10FFFF
请将对 unicode 代码点限制大小构成威胁的编辑列表集放入此处,如果它们超过 CJK 扩展 E(约 10,000 个字符):
As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):
leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.
In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than
0x10FFFF
Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):
目前,Unicode 标准没有定义任何高于 U+10FFFF 的字符,因此您可以对应用程序进行编码以拒绝高于该点的字符。
预测未来很困难,但我认为采用这种策略在短期内是安全的。老实说,即使 Unicode 在遥远的将来扩展到超过 U+10FFFF,它几乎肯定不会用于关键任务字形。您的应用程序可能与 2063 年推出的新 Ferengi 字体不兼容,但当它真正成为问题时,您始终可以修复它。
At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.
Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.
切入正题:
编码系统确实是有意只支持最大到 U+10FFFF 的代码点。
似乎并不存在任何很快就会用完的真正风险。
Cutting to the chase:
It is indeed intentional that the encoding system only supports code points up to U+10FFFF
It does not appear that there is any real risk of running out any time soon.
除了支持实际使用它们的任何遗留系统之外,没有理由编写支持 5-6 字节序列的 UTF-8 解析器。当前官方 UTF-8 规范不支持 5-6 字节序列,以适应与 UTF-16 之间的 100% 无损转换。如果 Unicode 有一天必须支持
U+10FFFF
以上的新代码点,那么将有足够的时间为更高的位数设计新的编码格式。或者,也许到那时,内存和计算能力就足够了,每个人都会将所有内容切换到 UTF-32,它最多可以处理超过 40 亿个字符的U+FFFFFFFF
。There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above
U+10FFFF
, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up toU+FFFFFFFF
for over 4 billion characters.