是否有一种语言在使用 UTF-8 编码时每个字符需要三个或更多字节?哪些?
常用的 ofc、克林贡语不算 :-)
谢谢,伙计们,让我运行 willItFit() 测试用例
好的,现在我弄清楚使用 UTF-8 保存字节导致的问题比解决的问题更多,再次感谢
Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
需要 3 个字节的字符从 U+0800 开始以及所有后续字符,因此潜在字符的数量巨大。这包括东亚文字,例如日语、中文、韩语和泰语。
有关脚本范围的完整列表,您可以参考Unicode 的块数据。只有这些块可以用 1 或 2 个字节表示,所有其他块中的字符需要 3 或 4 个字节:
Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.
For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:
开始了:
更多详细信息:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes ,基本多语言平面,从 0x8000 开始的代码。
一些示例:印度文字、泰语、菲律宾文字、平假名、片假名。所有东亚文字和其他一些文字。
Here we go:
More details:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.
Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.
甚至仅英语就需要三个字节。例如,印刷正确的撇号在 UTF-8 中编码为
0xE2 0x80 0x99
,左引号为0xE2 0x80 0x9C
,右引号为0xE2 0x80 0x9D
。省略号为0xE2 0x80 0xA6
。这甚至还没有讨论所有不同的破折号、空格或英寸和英尺符号。“如果没有撇号的帮助,写英语有点困难……”
You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as
0xE2 0x80 0x99
, opening quote marks are0xE2 0x80 0x9C
and closing quote marks are0xE2 0x80 0x9D
. The ellipsis is0xE2 0x80 0xA6
. And that's not even talking about all the different dashes, spaces or the inch and feet signs.“It’s kinda hard to write English without the apostrophe’s help …”
许多亚洲语言的表示形式都使用超过 2 个字节。虽然日语和韩语(至少)确实可能并不特别需要,但通常以多字节形式表示。
There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.