c++:获取宽字符的 ascii 值

发布于 2024-08-24 08:36:52 字数 169 浏览 8 评论 0原文

假设我有一个像“äa”这样的字符数组。 有没有办法获取第一个字符(多字节)的ascii值(例如228)? 即使我将数组转换为 wchar_t * 数组,我也无法获得“ä”的 ascii 值,因为它有 2 个字节长。 有没有办法做到这一点,我现在尝试了 2 天:(

我正在使用 gcc。

谢谢!

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(

i'm using gcc.

thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

も让我眼熟你 2024-08-31 08:36:52

你这是自相矛盾。像 ä 这样的国际字符(根据定义)不在 ASCII 字符集中,因此它们没有“ascii 值”。

这取决于两个字符数组的确切编码,是否可以获得单个字符的代码点,如果可以的话,它将采用哪种格式。

You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".

It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.

三生池水覆流年 2024-08-31 08:36:52

你很困惑。 ASCII 仅具有小于 128 的值。值 228 对应于 8 位字符集 ISO-8859-1、CP1252 等中的 ä。它也是 Unicode 系统中 ä 的 UCS 值。如果您使用字符串文字“ä”并获取两个字符的字符串,则该字符串实际上是用 UTF-8 编码的,您可能希望解析 UTF-8 编码以获取 Unicode UCS 值。

更有可能的是,您真正想要做的是将一种字符集转换为另一种字符集。如何执行此操作在很大程度上取决于您的操作系统,因此需要更多信息。您还需要具体说明您到底想要什么?也许是 ISO-8859-1 的 std::string 或 char* ?

You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.

More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?

Saygoodbye 2024-08-31 08:36:52

有一个标准的 C++ 模板函数可以执行该转换, ctype::narrow( )。它是本地化库的一部分。如果可能的话,它将把宽字符转换为当前本地的等效字符值。正如其他答案所指出的,并不总是存在映射,这就是为什么 ctype::narrow() 采用默认字符,如果没有映射,它将返回该默认字符。

There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.

弱骨蛰伏 2024-08-31 08:36:52

取决于您的 char 数组中使用的编码。

如果你的 char 数组是 Latin 1 编码的,那么它有 2 个字节长(可能还加上一个 NUL 终止符,我们不在乎),这 2 个字节是:

  • 0xE4 (小写变音符号)
  • 0x61 (小写 a) 。

请注意,Latin 1 不是 ASCII,0xE4 也不是 ASCII 值,它是 Latin 1(或 Unicode)值。

你会得到这样的值:

int i = (unsigned char) my_array[0];

如果你的 char 数组是 UTF-8 编码的,那么它的长度是三个字节,这些字节是:

  • 二进制 11000011(UTF-8 编码的第一个字节 0xE4)
  • 二进制 10100100(UTF-8 的第二个字节) 8 编码 0xE4)
  • 0x61 (小写 a)

要恢复使用 UTF-8 编码的字符的 Unicode 值,您需要根据 http://en.wikipedia.org/wiki/UTF-8#Description (在生产代码中通常是个坏主意),否则您需要使用特定于平台的 unicode 到 wchar_t 转换例程。在 Linux 上,这是 mbstowcs 或 iconv,但对于单个字符,您可以使用 mbtowc,前提是为当前语言环境定义的多字节编码实际上是 UTF-8:

wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
    // handle error
}

如果是 SHIFT-JIS 那么这不起作用......

Depends on the encoding used in your char array.

If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:

  • 0xE4 (lower-case a umlaut)
  • 0x61 (lower-case a).

Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.

You would get the value like this:

int i = (unsigned char) my_array[0];

If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:

  • binary 11000011 (first byte of UTF-8 encoded 0xE4)
  • binary 10100100 (second byte of UTF-8 encoded 0xE4)
  • 0x61 (lower-case a)

To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:

wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
    // handle error
}

If it's SHIFT-JIS then this doesn't work...

怼怹恏 2024-08-31 08:36:52

您想要的称为音译 - 将一种语言的字母转换为另一种语言。它与 unicode 和 wchars 无关。你需要有一个映射表。

what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文