如何在 C# 中获取 unicode 字符的十进制值?

发布于 2024-12-11 00:37:55 字数 1581 浏览 0 评论 0原文

如何在 C# 中获取 unicode 字符的数值?

例如,如果泰米尔语字符 (U+0B85 )给定,输出应为2949(即0x0B85

另请参阅

多代码点字符

有些字符需要多个代码点。在这个例子中,UTF-16,每个代码单元仍然在基本多语言平面中:

  • 在此处输入图像描述(即<代码>U+0072 U+0327 U+030C)
  • 在此处输入图像描述(即 U+0072 U+0338 U+0327 U+0316 U+0317 U+0300 U+0301 U+0302 U+0308 U+0360)

更重要的一点是,一个“字符”可能需要超过 1 个 UTF-16 代码单元,它可以需要超过 2 个 UTF-16 代码单元,它可以需要超过 3 个 UTF-16 代码单元。

更重要的一点是,一个“字符”可能需要数十个 unicode 代码点。在 C# 中的 UTF-16 中,这意味着超过 1 个 char。一个字符可能需要 17 个字符

我的问题是关于将 char 转换为 UTF-16 编码值。即使一整串17个char只代表一个“字符”,我仍然想知道如何将每个UTF-16单位转换为数值。

例如

String s = "அ";

int i = Unicode(s[0]);

其中 Unicode 返回整数值,如Unicode 标准,用于输入表达式的第一个字符。

How do i get the numeric value of a unicode character in C#?

For example if tamil character (U+0B85) given, output should be 2949 (i.e. 0x0B85)

See also

Multi code-point characters

Some characters require multiple code points. In this example, UTF-16, each code unit is still in the Basic Multilingual Plane:

  • enter image description here (i.e. U+0072 U+0327 U+030C)
  • enter image description here (i.e. U+0072 U+0338 U+0327 U+0316 U+0317 U+0300 U+0301 U+0302 U+0308 U+0360)

The larger point being that one "character" can require more than 1 UTF-16 code unit, it can require more than 2 UTF-16 code units, it can require more than 3 UTF-16 code units.

The larger point being that one "character" can require dozens of unicode code points. In UTF-16 in C# that means more than 1 char. One character can require 17 char.

My question was about converting char into a UTF-16 encoding value. Even if an entire string of 17 char only represents one "character", i still want to know how to convert each UTF-16 unit into a numeric value.

e.g.

String s = "அ";

int i = Unicode(s[0]);

Where Unicode returns the integer value, as defined by the Unicode standard, for the first character of the input expression.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

↙温凉少女 2024-12-18 00:37:55

它与Java基本相同。如果您将其作为 char 获取,则可以隐式转换为 int

char c = '\u0b85';

// Implicit conversion: char is basically a 16-bit unsigned integer
int x = c;
Console.WriteLine(x); // Prints 2949

如果您将其作为字符串的一部分获取,只需先获取该单个字符:

string text = GetText();
int x = text[2]; // Or whatever...

请注意,不在基本多语言平面中的字符将表示为两个 UTF-16 代码单元。 .NET 支持查找完整的 Unicode 代码点,但这并不简单。

It's basically the same as Java. If you've got it as a char, you can just convert to int implicitly:

char c = '\u0b85';

// Implicit conversion: char is basically a 16-bit unsigned integer
int x = c;
Console.WriteLine(x); // Prints 2949

If you've got it as part of a string, just get that single character first:

string text = GetText();
int x = text[2]; // Or whatever...

Note that characters not in the basic multilingual plane will be represented as two UTF-16 code units. There is support in .NET for finding the full Unicode code point, but it's not simple.

围归者 2024-12-18 00:37:55
((int)'அ').ToString()

如果您的字符为 char,则可以将其转换为 int,它将表示字符的数值。然后,您可以以任何您喜欢的方式打印出来,就像打印任何其他整数一样。

如果您想要十六进制输出,则可以使用:

((int)'அ').ToString("X4")

X 表示十六进制,4 表示用零填充到四个字符。

((int)'அ').ToString()

If you have the character as a char, you can cast that to an int, which will represent the character's numeric value. You can then print that out in any way you like, just like with any other integer.

If you wanted hexadecimal output instead, you can use:

((int)'அ').ToString("X4")

X is for hexadecimal, 4 is for zero-padding to four characters.

述情 2024-12-18 00:37:55

如何在 C# 中获取 unicode 字符的数值?

char 不一定是整个 Unicode 代码点。在 UTF-16 编码语言(例如 C#)中,您实际上可能需要 2 个 char 来表示单个“逻辑”字符。并且您的字符串长度可能不是您所期望的 - MSDN 文档String.Length 属性 表示:

“Length 属性返回此实例中 Char 对象的数量,而不是 Unicode 字符的数量。”

  • 因此,如果您的 Unicode 字符仅编码为一个char,它已经是数字(本质上是一个无符号的 16 位整数)。您可能希望将其转换为某些整数类型,但这不会改变 char 中最初存在的实际位。
  • 如果您的 Unicode 字符为 2 个 char,您需要将一个乘以 2^16,然后将其与另一个相加,得到一个 uint 数值:

    char c1 = ...;
    字符 c2 = ...;
    uint c = ((uint)c1 << 16) | c2;

如何在 C# 中获取 unicode 字符的十进制值?

当您说“十进制”时,这通常意味着仅包含人类将其解释为十进制数字的字符的字符串。

  • 如果您只能用一个 char 来表示您的 Unicode 字符,则只需通过以下方式将其转换为十进制字符串:

    char c = 'அ';
    string s = ((ushort)c).ToString();

  • 如果您的 Unicode 字符有 2 个 chars,请如上所述将它们转换为 uint,然后调用 uint.ToString。< /p>

--- 编辑 ---

AFAIK 变音标记被视为单独的“字符”(和单独的代码点),尽管在视觉上与“基本”字符一起呈现。这些代码点中的每一个单独计算仍然最多为 2 个 UTF-16 代码单元。

顺便说一句,我认为你所谈论的正确名称不是“字符”,而是“组合字符”。所以,是的,单个组合字符可以有超过 1 个代码点,因此可以有超过 2 个代码单元。如果您想要十进制表示形式(例如组合字符),您可能可以通过 BigInteger 最轻松地完成:

string c = "\x0072\x0338\x0327\x0316\x0317\x0300\x0301\x0302\x0308\x0360";
string s = (new BigInteger(Encoding.Unicode.GetBytes(c))).ToString();

根据您希望的代码单元“数字”的重要性顺序,您可能需要反转c

How do i get the numeric value of a unicode character in C#?

A char is not necessarily the whole Unicode code point. In UTF-16 encoded languages such as C#, you may actually need 2 chars to represent a single "logical" character. And your string lengths migh not be what you expect - the MSDN documnetation for String.Length Property says:

"The Length property returns the number of Char objects in this instance, not the number of Unicode characters."

  • So, if your Unicode character is encoded in just one char, it is already numeric (essentially an unsigned 16-bit integer). You may want to cast it to some of the integer types, but this won't change the actual bits that were originally present in the char.
  • If your Unicode character is 2 chars, you'll need to multiply one by 2^16 and add it to the other, resulting in a uint numeric value:

    char c1 = ...;
    char c2 = ...;
    uint c = ((uint)c1 << 16) | c2;

How do i get the decimal value of a unicode character in C#?

When you say "decimal", this usually means a character string containing only characters that a human being would interpret as decimal digits.

  • If you can represent your Unicode character by only one char, you can convert it to decimal string simply by:

    char c = 'அ';
    string s = ((ushort)c).ToString();

  • If you have 2 chars for your Unicode character, convert them to a uint as described above, then call uint.ToString.

--- EDIT ---

AFAIK diacritical marks are considered separate "characters" (and separate code points) despite being visually rendered together with the "base" character. Each of these code points taken alone is still at most 2 UTF-16 code units.

BTW I think the proper name for what you are talking about is not "character" but "combining character". So yes, a single combining character can have more than 1 code point and therefore more than 2 code units. If you want a decimal representation of such as combining character, you can probably do it most easily through BigInteger:

string c = "\x0072\x0338\x0327\x0316\x0317\x0300\x0301\x0302\x0308\x0360";
string s = (new BigInteger(Encoding.Unicode.GetBytes(c))).ToString();

Depending on what order of significance of the code unit "digits" you wish, you may want reverse the c.

不及他 2024-12-18 00:37:55
char c = 'அ';
short code = (short)c;
ushort code2 = (ushort)c;
char c = 'அ';
short code = (short)c;
ushort code2 = (ushort)c;
微凉 2024-12-18 00:37:55

这是使用平面 1(补充多语言平面 (SMP))的示例:

string single_character = "\U00013000"; //first Egyptian ancient hieroglyph in hex
//it is encoded as 4 bytes (instead of 2)

//get the Unicode index using UTF32 (4 bytes fixed encoding)
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0); //in decimal

This is an example of using Plane 1, the Supplementary Multilingual Plane (SMP):

string single_character = "\U00013000"; //first Egyptian ancient hieroglyph in hex
//it is encoded as 4 bytes (instead of 2)

//get the Unicode index using UTF32 (4 bytes fixed encoding)
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0); //in decimal
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文