是否有一种语言在使用 UTF-8 编码时每个字符需要三个或更多字节?哪些?

发布于 2024-09-18 07:30:38 字数 111 浏览 11 评论 0原文

常用的 ofc、克林贡语不算 :-)

谢谢,伙计们,让我运行 willItFit() 测试用例

好的,现在我弄清楚使用 UTF-8 保存字节导致的问题比解决的问题更多,再次感谢

Commonly used ofc, Klingon doesnt count :-)

thanks, guys, let me run willItFit() testcases

OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

温暖的光 2024-09-25 07:30:38

需要 3 个字节的字符从 U+0800 开始以及所有后续字符,因此潜在字符的数量巨大。这包括东亚文字,例如日语、中文、韩语和泰语。

有关脚本范围的完整列表,您可以参考Unicode 的块数据。只有这些块可以用 1 或 2 个字节表示,所有其他块中的字符需要 3 或 4 个字节:

0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo

Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.

For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:

0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
疯狂的代价 2024-09-25 07:30:38

开始了:

所以前 128 个字符 (US-ASCII)
需要一个字节。接下来 1,920
字符需要两个字节来编码。
这包括拉丁字母
希腊语的变音符号和字符,
西里尔文、科普特文、亚美尼亚文、希伯来文、
阿拉伯语、叙利亚语和塔纳语字母。
其余部分需要三个字节
基本多语言平面(
几乎包含了所有字符
共同使用)。需要四个字节
其他位面的人物
Unicode,其中包括不太常见的 CJK
人物和各种历史
脚本。

更多详细信息:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes ,基本多语言平面,从 0x8000 开始的代码。

一些示例:印度文字、泰语、菲律宾文字、平假名、片假名。所有东亚文字和其他一些文字。

Here we go:

So the first 128 characters (US-ASCII)
need one byte. The next 1,920
characters need two bytes to encode.
This includes Latin letters with
diacritics and characters from Greek,
Cyrillic, Coptic, Armenian, Hebrew,
Arabic, Syriac and Tāna alphabets.
Three bytes are needed for the rest of
the Basic Multilingual Plane (which
contains virtually all characters in
common use). Four bytes are needed for
characters in the other planes of
Unicode, which include less common CJK
characters and various historic
scripts.

More details:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.

Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.

噩梦成真你也成魔 2024-09-25 07:30:38

甚至仅英语就需要三个字节。例如,印刷正确的撇号在 UTF-8 中编码为 0xE2 0x80 0x99,左引号为 0xE2 0x80 0x9C,右引号为 0xE2 0x80 0x9D。省略号为0xE2 0x80 0xA6。这甚至还没有讨论所有不同的破折号、空格或英寸和英尺符号。

“如果没有撇号的帮助,写英语有点困难……”

You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as 0xE2 0x80 0x99, opening quote marks are 0xE2 0x80 0x9C and closing quote marks are 0xE2 0x80 0x9D. The ellipsis is 0xE2 0x80 0xA6. And that's not even talking about all the different dashes, spaces or the inch and feet signs.

“It’s kinda hard to write English without the apostrophe’s help …”

人生百味 2024-09-25 07:30:38

许多亚洲语言的表示形式都使用超过 2 个字节。虽然日语和韩语(至少)确实可能并不特别需要,但通常以多字节形式表示。

There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文