如何在 Windows API 中正确使用 CharNext?

发布于 2024-07-30 10:37:52 字数 2309 浏览 5 评论 0原文

我有一个包含日语和拉丁字符混合的多字节字符串。 我正在尝试将此字符串的部分复制到单独的内存位置。 由于它是一个多字节字符串,因此某些字符使用一个字节,其他字符使用两个字节。 复制部分字符串时,我不能复制“一半”日语字符。 为了能够正确执行此操作,我需要能够确定多字节字符串字符的开始和结束位置。

举个例子,如果字符串包含 3 个字符,需要 [2 个字节][2 个字节][1 个字节],我必须将 2、4 或 5 个字节复制到其他位置,而不是 3 个,因为如果我复制 3 个字节只会复制第二个字符的一半。

为了弄清楚多字节字符串字符的开始和结束位置,我尝试使用 Windows API 函数 CharNext 和 CharNextExA 但没有运气。 当我使用这些函数时,它们一次一个字节地浏览我的字符串,而不是一次一个字符。 根据 MSDN,CharNext 应该CharNext 函数检索指向字符串中下一个字符的指针。

下面是一些代码来说明这个问题:(

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{
   // Convert the asian string from wide char to multi-byte.
   LPSTR mbString = new char[1000];
   WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100,  NULL, NULL);

   // Count the number of characters in the string.
   int characterCount = 0;
   LPSTR currentCharacter = mbString;
   while (*currentCharacter)
   {
      characterCount++;

     currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
   }
}

请忽略内存泄漏和错误检查失败。)

现在,在上面的示例中,我希望 characterCount 变为 6,因为这是亚洲字符串中的字符数。 但相反,characterCount 变为 18,因为 mbString 包含 18 个字符:

門阜陀阿阻附

我不明白它应该如何工作。 CharNext 如何知道字符串中的“é–€é”是否是日语字符的编码版本,或者实际上是字符 é – € 和 é?

一些注释:

  • 我读过 Joels 的博客文章,了解每个开发人员需要了解的 Unicode 知识。 我可能误解了其中的某些内容。
  • 如果我只想计算字符数,我可以直接计算 asian 字符串中的字符。 请记住,我的真正目标是将多字节字符串的部分复制到单独的位置。 单独的位置仅支持多字节,不支持宽字符。
  • 如果我使用 MultiByteToWideChar 将 mbString 的内容转换回宽字符,我会得到正确的字符串(门阜陀阿阻附),这表明 mbString 没有任何问题。

编辑: 显然,CharNext 函数不支持 UTF-8,但 Microsoft 忘记记录这一点。 我把自己的例程扔/复制粘贴在一起,我不会使用它,并且需要改进。 我猜它很容易崩溃。

  LPSTR CharMoveNext(LPSTR szString)
  {
     if (szString == 0 || *szString == 0)
        return 0;

     if ( (szString[0] & 0x80) == 0x00)
        return szString + 1;
     else if ( (szString[0] & 0xE0) == 0xC0)
        return szString + 2;
     else if ( (szString[0] & 0xF0) == 0xE0)
        return szString + 3;
     else if ( (szString[0] & 0xF8) == 0xF0)
        return szString + 4;
     else
        return szString +1;
  }

I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends.

As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.

To figure out where in the multi-byte string characters starts and ends, I'm trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to The CharNext function retrieves a pointer to the next character in a string..

Here's some code to illustrate this problem:

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{
   // Convert the asian string from wide char to multi-byte.
   LPSTR mbString = new char[1000];
   WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100,  NULL, NULL);

   // Count the number of characters in the string.
   int characterCount = 0;
   LPSTR currentCharacter = mbString;
   while (*currentCharacter)
   {
      characterCount++;

     currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
   }
}

(please ignore memory leak and failure to do error checking.)

Now, in the example above I would expect that characterCount becomes 6, since that's the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:

門阜陀阿阻附

I don't understand how it's supposed to work. How is CharNext supposed to know whether "é–€é" in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?

Some notes:

  • I've read Joels blog post about what every developer needs to know about Unicode. I may have misunderstood something in it though.
  • If all I wanted to do was to count the characters, I could count the characters in the asian string directly. Keep in mind that my real goal is copying parts of the multi-byte string to a separate location. The separate location only supports multi-byte, not widechar.
  • If I convert the content of mbString back to wide char using MultiByteToWideChar, I get the correct string (門阜陀阿阻附), which indicates that there's nothing wrong with mbString.

EDIT:
Apparantly the CharNext functions doesn't support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won't use and which needs improving. I'm guessing it's easily crashable.

  LPSTR CharMoveNext(LPSTR szString)
  {
     if (szString == 0 || *szString == 0)
        return 0;

     if ( (szString[0] & 0x80) == 0x00)
        return szString + 1;
     else if ( (szString[0] & 0xE0) == 0xC0)
        return szString + 2;
     else if ( (szString[0] & 0xF0) == 0xE0)
        return szString + 3;
     else if ( (szString[0] & 0xF8) == 0xF0)
        return szString + 4;
     else
        return szString +1;
  }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

听,心雨的声音 2024-08-06 10:37:52

Here is a really good explanation of what is going on here at the Sorting it All Out blog: Is CharNextExA broken?. In short, CharNext is not designed to work with UTF8 strings.

反差帅 2024-08-06 10:37:52

据我所知(谷歌和实验),CharNextExA实际上不适用于UTF-8,仅支持使用较短的前导/尾随字节对或单字节字符的多字节编码。

UTF-8 是一种相当常规的编码,有很多库可以满足您的需求,但您自己的库也相当容易。

看看这里 unicode.org,特别是表 3-7为有效的序列形式。

const char* NextUtf8( const char* in )
{
    if( in == NULL || *in == '\0' )
        return in;

    unsigned char uc = static_cast<unsigned char>(*in);

    if( uc < 0x80 )
    {
        return in + 1;
    }
    else if( uc < 0xc2 )
    {
         // throw error? invalid lead byte
    }
    else if( uc < 0xe0 )
    {
        // check in[1] for validity( 0x80 .. 0xBF )
        return in + 2;
    }
    else if( uc < 0xe1 )
    {
        // check in[1] for validity( 0xA0 .. 0xBF )
        // check in[2] for validity( 0x80 .. 0xBF )
        return in + 3;
    }
    else // ... etc.
    // ...
}

As far as I can determine (google and experimentation), CharNextExA doesn't actually work with UTF-8, only supported multibyte encodings that use shorter lead/trail byte pairs or single byte characters.

UTF-8 is a fairly regular encoding, there are a lot of libraries that will do what you want but it's also fairly easy to roll your own.

Have a look in here unicode.org, particularly table 3-7 for valid sequence forms.

const char* NextUtf8( const char* in )
{
    if( in == NULL || *in == '\0' )
        return in;

    unsigned char uc = static_cast<unsigned char>(*in);

    if( uc < 0x80 )
    {
        return in + 1;
    }
    else if( uc < 0xc2 )
    {
         // throw error? invalid lead byte
    }
    else if( uc < 0xe0 )
    {
        // check in[1] for validity( 0x80 .. 0xBF )
        return in + 2;
    }
    else if( uc < 0xe1 )
    {
        // check in[1] for validity( 0xA0 .. 0xBF )
        // check in[2] for validity( 0x80 .. 0xBF )
        return in + 3;
    }
    else // ... etc.
    // ...
}
枫林﹌晚霞¤ 2024-08-06 10:37:52

鉴于 CharNextExA 不适用于UTF-8,你可以自己解析。 只需跳过前两位为 10 的字符即可。 您可以在 UTF-8 的定义中看到该模式: http://en.wikipedia.org /wiki/UTF-8

LPSTR CharMoveNext(LPSTR szString)
{
    ++szString;
    while ((*szString & 0xc0) == 0x80)
        ++szString;
    return szString;
}

Given that CharNextExA doesn't work with UTF-8, you can parse it yourself. Just skip over the characters that have 10 in the top two bits. You can see the pattern in the definition of UTF-8: http://en.wikipedia.org/wiki/Utf-8

LPSTR CharMoveNext(LPSTR szString)
{
    ++szString;
    while ((*szString & 0xc0) == 0x80)
        ++szString;
    return szString;
}
(り薆情海 2024-08-06 10:37:52

这不是对您的问题的直接答案,但您可能会发现以下教程很有帮助,我确实如此。 事实上,这里提供的信息足以让您自己轻松遍历多字节字符串:

完整的字符串教程

This isn't a direct answer to your question, but you may find the following tutorial helpful, I certainly did. In fact the information provided here is enough that you should be able to traverse the multi-byte string yourself with ease:

Complete String Tutorial

琴流音 2024-08-06 10:37:52

尝试使用 932 作为代码页。 我不认为 CP_UTF8 是一个真正的代码页,它可能只适用于 WideCharToMultibyte() 及其返回。 您还可以尝试 isleadByte(),但这需要正确设置区域设置,或正确设置默认代码页。 我已成功使用 IsDBCSLeadByteEx(),但从未使用过 CP_UTF8。

Try using 932 for the code page. I don't think CP_UTF8 is a real codepage, and it may only work for WideCharToMultibyte() and back. You can also try isleadByte(), but that requires either setting the locale correctly, or setting the default codepage correctly. I have successfully used IsDBCSLeadByteEx(), but never with CP_UTF8.

伪装你 2024-08-06 10:37:52
static const char *CharNextUTF8(const char *psz)
{
    // get the first char, and then move the
    // pointer to the next byte by default.
    BYTE c = (BYTE)*psz++;

    // if the highest bit of the char is set ...
    if (c & 0x80)
    {
        BYTE x = 0;

        // count the continuous bits set after the highest bit,
        // that means to calculate the count of following bytes.
        while (c & 0x40)
        {
            c <<= 1;
            x++;
        }

        // ok, there should be 'x' bytes following the first byte.
        for (BYTE i = 0; i < x; i++)
        {
            // if any byte is not a valid following byte...
            if ((psz[i] & 0xC0) != 0x80)
            {
                goto done;
            }
        }

        // all the following bytes are valid,
        // move the pointer to skip all.
        psz += x;
    }

done:
    return psz;
}
static const char *CharNextUTF8(const char *psz)
{
    // get the first char, and then move the
    // pointer to the next byte by default.
    BYTE c = (BYTE)*psz++;

    // if the highest bit of the char is set ...
    if (c & 0x80)
    {
        BYTE x = 0;

        // count the continuous bits set after the highest bit,
        // that means to calculate the count of following bytes.
        while (c & 0x40)
        {
            c <<= 1;
            x++;
        }

        // ok, there should be 'x' bytes following the first byte.
        for (BYTE i = 0; i < x; i++)
        {
            // if any byte is not a valid following byte...
            if ((psz[i] & 0xC0) != 0x80)
            {
                goto done;
            }
        }

        // all the following bytes are valid,
        // move the pointer to skip all.
        psz += x;
    }

done:
    return psz;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文