当前位置：文江博客话题详情

我们需要多少字节来存储一个阿拉伯字符

发布于 2024-10-04 23:39:19 字数 296 浏览 4 评论 0原文

我对表示阿拉伯字符所需的存储有点困惑。

请告诉我这是否属实：

在 ISO/IEC 8859-6 编码中需要 2 个字节 (http://en.wikipedia.org/wiki/ISO/IEC_8859-6)
在 UNICODE 中需要 4 个字节 (http://en.wikipedia.org/wiki/ISO/IEC_8859-6) /en.wikipedia.org/wiki/Arabic_Unicode）

每种编码的优点是什么？我们什么时候应该选择其中一种而不是另一种？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

偏闹i 2024-10-11 23:39:19

首先，Unicode 不是一种编码。它是为每种语言的每个字符分配代码点的标准。这些代码点是整数；它们占用多少字节取决于具体的编码。最常见的 Unicode 编码是 UTF-8 和 UTF-16。

总结一下：

ISO 8859-6 对每个阿拉伯字符使用 1 个字节，但不支持“阿拉伯表示形式”，也不支持 ASCII 以外的任何其他脚本中的字符。
UTF-8 对每个阿拉伯字符使用 2 个字节，对“阿拉伯语表示形式”使用 3 个字节。
UTF-16 每个阿拉伯字符使用 2 个字节，包括“阿拉伯表示形式”。

我将使用两个示例：“Í”(U+062D) 和“ﻰ”(U+FEF0)。这些数字是十六进制代码，表示每个字符的 Unicode 代码点。

在 ISO 8859-6 中，大多数阿拉伯字符仅占用一个字节，因为该编码专用于阿拉伯语。例如，字符“Í”（U+062D）被编码为单字节“CD”，正如您可以从维基百科文章。字符“ﻰ”（U+FEF0）被列为“阿拉伯语表示形式”，所以我想这解释了为什么它根本没有出现在 ISO 8859-6 中（你不能用该编码来编码这个字符）。

有两种非常常见的 Unicode 编码可让您对所有字符进行编码： UTF-8 和 < a href="http://en.wikipedia.org/wiki/UTF-16/UCS-2" rel="noreferrer">UTF-16。它们的用途略有不同。 UTF-8 对 ASCII 字符使用 1 个字节，对基本字符（包括所有阿拉伯字符）使用 2 到 3 个字节，对其他字符使用 4 个字节。 UTF-16 对基本字符使用 2 个字节，对其他字符使用 4 个字节。所以基本上，如果您使用大量 ASCII，UTF-8 会更好。对于国际文本，UTF-16 更好。

在 UTF-8 中，“Í”(U+062D) 被编码为 2 字节序列“D8 AD”，而“ﻰ”(U+FEF0) 被编码为 3 字节序列“EF BB B0”。基本上，U+0080 和 U+07FF 之间的字符使用 2 个字节，U+07FF 和 U+FFFF 之间的字符使用 3 个字节。因此，所有基本阿拉伯语和阿拉伯语补充字符都使用 2 个字节，而阿拉伯语表示形式则使用 3 个字节。

在 UTF-16 中，“Í”(U+062D) 被编码为 2 字节序列“2D 06”，而“ﻰ”(U+FEF0) 被编码为 2 字节序列“F0 FE”。在 UTF-16 中，所有阿拉伯字符都是两个字节。字节顺序使情况变得更加复杂。请注意，UTF-16 中的字节只是两个部分互换的代码点。对于第一个编码，同样有效的编码是“06 2D”，对于第二个编码是“FE F0”。

总之，我通常会推荐 UTF-8，因为它明确并且很好地支持 ASCII 文本。阿拉伯字符在任一编码中都是 2 个字节（除非您使用“表示形式”）。如果您只使用 ASCII 和阿拉伯字符，而不使用其他字符，则可以使用 ISO 8859-6，这会节省您一些空间，但通常不值得，因为一旦出现其他字符，它就会中断。 UTF-8 和 UTF-16 支持 Unicode 中的所有字符。

Well first, Unicode is not an encoding. It is a standard for assigning code points to every character in every language. These code points are integers; how many bytes they take up depends on the specific encoding. The most common Unicode encodings are UTF-8 and UTF-16.

To summarise:

ISO 8859-6 uses 1 byte for each Arabic character, but doesn't support "Arabic presentation forms", nor characters from any other script than ASCII.
UTF-8 uses 2 bytes for each Arabic character, and 3 bytes for "Arabic presentation forms".
UTF-16 uses 2 bytes for each Arabic character, including "Arabic presentation forms".

I will use two examples: 'ح' (U+062D) and 'ﻰ' (U+FEF0). Those numbers are hexadecimal codes representing the Unicode code point of each of those characters.

In ISO 8859-6, most Arabic characters take up just a single byte, since that encoding is dedicated to Arabic. For example, the character 'ح' (U+062D) is encoded as the single byte "CD", as you can see from the table on the Wikipedia article. The character 'ﻰ' (U+FEF0) is listed as an "Arabic Presentation Form", so I suppose that explains why it doesn't appear in ISO 8859-6 at all (you can't encode this character in that encoding).

There are two very common Unicode encodings which let you encode all characters: UTF-8 and UTF-16. They have slightly different uses. UTF-8 uses one byte for ASCII characters, between 2 and 3 bytes for basic characters (including all of Arabic) and 4 bytes for other characters. UTF-16 uses two bytes for basic characters, and 4 bytes for other characters. So basically, if you are using lots of ASCII, UTF-8 is better. For international text, UTF-16 is better.

In UTF-8, 'ح' (U+062D) is encoded as the 2-byte sequence "D8 AD", while 'ﻰ' (U+FEF0) is encoded as the 3-byte sequence "EF BB B0". Basically, characters between U+0080 and U+07FF use 2 bytes, and characters between U+07FF and U+FFFF use 3 bytes. So all the basic Arabic and Arabic supplement characters use 2 bytes, whereas the Arabic Presentation Forms use 3 bytes.

In UTF-16, 'ح' (U+062D) is encoded as the 2-byte sequence "2D 06", while 'ﻰ' (U+FEF0) is encoded as the 2-byte sequence "F0 FE". In UTF-16, all Arabic characters are two bytes. This is further complicated by endianness. Note that the bytes in UTF-16 are just the code points with the two parts swapped around. An equally valid encoding is "06 2D" for the first one, and "FE F0" for the second.

In summary, I would usually recommend UTF-8 as it is unambiguous and supports ASCII text very well. Arabic characters are 2 bytes in either encoding (unless you use "presentation forms"). You can use ISO 8859-6 if you are only using ASCII and Arabic characters, and nothing else, and that will save you some space, but it usually isn't worth it, as it will break as soon as some other characters come along. UTF-8 and UTF-16 support all characters in Unicode.

回复收藏 0 原文