获取 UTF-8 编码的 std::string 的实际长度?

发布于 2024-09-30 15:19:15 字数 500 浏览 8 评论 0原文

我的 std::string 是 UTF-8 编码的,所以显然,str.length() 返回错误的结果。

我找到了此信息,但我不确定如何使用它来执行此操作:

以下字节序列是 用于表示一个字符。这 序列为 使用取决于字符的 UCS 代码编号:

<前><代码> 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How can I find theactual length of a UTF-8编码std::string?谢谢

My std::string is UTF-8 encoded so obviously, str.length() returns the wrong result.

I found this information but I'm not sure how I can use it to do this:

The following byte sequences are
used to represent a character. The
sequence to be
used depends on the UCS code number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How can I find the actual length of a UTF-8 encoded std::string? Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

百合的盛世恋 2024-10-07 15:19:15

计算所有第一个字节(不匹配 10xxxxxx 的字节)。

int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
小矜持 2024-10-07 15:19:15

C++ 对编码一无所知,所以你不能指望使用
标准函数来执行此操作。

标准库确实确实承认字符编码(以区域设置的形式)的存在。如果您的系统支持区域设置,则使用标准库来计算字符串的长度非常容易。在下面的示例代码中,我假设您的系统支持区域设置 en_US.utf8。如果我编译代码并将其执行为“./a.out ソuniーSony”,则输出有 13 个字符值和 7 个字符。并且所有这些都无需参考 UTF-8 字符代码的内部表示或必须使用第 3 方库。

#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>

using namespace std;

int main(int argc, char *argv[])
{
  string str(argv[1]);
  unsigned int strLen = str.length();
  cout << "Length (char-values): " << strLen << '\n';
  setlocale(LC_ALL, "en_US.utf8");
  unsigned int u = 0;
  const char *c_str = str.c_str();
  unsigned int charCount = 0;
  while(u < strLen)
  {
    u += mblen(&c_str[u], strLen - u);
    charCount += 1;
  }
  cout << "Length (characters): " << charCount << endl; 
}

C++ knows nothing about encodings, so you can't expect to use a
standard function to do this.

The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.

#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>

using namespace std;

int main(int argc, char *argv[])
{
  string str(argv[1]);
  unsigned int strLen = str.length();
  cout << "Length (char-values): " << strLen << '\n';
  setlocale(LC_ALL, "en_US.utf8");
  unsigned int u = 0;
  const char *c_str = str.c_str();
  unsigned int charCount = 0;
  while(u < strLen)
  {
    u += mblen(&c_str[u], strLen - u);
    charCount += 1;
  }
  cout << "Length (characters): " << charCount << endl; 
}
吃→可爱长大的 2024-10-07 15:19:15

这是一个幼稚的实现,但了解它是如何完成的应该对您有所帮助:

std::size_t utf8_length(std::string const &s) {
  std::size_t len = 0;
  std::string::const_iterator begin = s.begin(), end = s.end();
  while (begin != end) {
    unsigned char c = *begin;
    int n;
    if      ((c & 0x80) == 0)    n = 1;
    else if ((c & 0xE0) == 0xC0) n = 2;
    else if ((c & 0xF0) == 0xE0) n = 3;
    else if ((c & 0xF8) == 0xF0) n = 4;
    else throw std::runtime_error("utf8_length: invalid UTF-8");

    if (end - begin < n) {
      throw std::runtime_error("utf8_length: string too short");
    }
    for (int i = 1; i < n; ++i) {
      if ((begin[i] & 0xC0) != 0x80) {
        throw std::runtime_error("utf8_length: expected continuation byte");
      }
    }
    len += n;
    begin += n;
  }
  return len;
}

This is a naive implementation, but it should be helpful for you to see how this is done:

std::size_t utf8_length(std::string const &s) {
  std::size_t len = 0;
  std::string::const_iterator begin = s.begin(), end = s.end();
  while (begin != end) {
    unsigned char c = *begin;
    int n;
    if      ((c & 0x80) == 0)    n = 1;
    else if ((c & 0xE0) == 0xC0) n = 2;
    else if ((c & 0xF0) == 0xE0) n = 3;
    else if ((c & 0xF8) == 0xF0) n = 4;
    else throw std::runtime_error("utf8_length: invalid UTF-8");

    if (end - begin < n) {
      throw std::runtime_error("utf8_length: string too short");
    }
    for (int i = 1; i < n; ++i) {
      if ((begin[i] & 0xC0) != 0x80) {
        throw std::runtime_error("utf8_length: expected continuation byte");
      }
    }
    len += n;
    begin += n;
  }
  return len;
}
请恋爱 2024-10-07 15:19:15

您可能应该听取 Omry 的建议,并为此寻找专门的图书馆。也就是说,如果您只是想了解执行此操作的算法,我会将其发布在下面。

基本上,您可以将字符串转换为更宽的元素格式,例如 wchar_t。请注意,wchar_t 存在一些可移植性问题,因为 wchar_t 的大小取决于您的平台。在 Windows 上,wchar_t 为 2 个字节,因此非常适合表示 UTF-16。但在 UNIX/Linux 上,它是四个字节,因此用于表示 UTF-32。因此,对于 Windows,只有当您不包含任何高于 0xFFFF 的 Unicode 代码点时,这才有效。对于 Linux,您可以在 wchar_t 中包含整个代码点范围。 (幸运的是,这个问题将通过 C++0x Unicode 字符类型得到缓解。)

注意到这一点,您可以使用以下算法创建一个转换函数:

template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out) 
{
    while (it != end) 
    {
        if (*it < 192) *out++ = *it++; // single byte character
        else if (*it < 224 && it + 1 < end && *(it+1) > 127) { 
            // double byte character
            *out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
            it += 2;
        }
        else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) { 
            // triple byte character
            *out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
            it += 3;
        }
        else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) { 
            // 4-byte character
            *out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
                ((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
            it += 4;
        }
        else ++it; // Invalid byte sequence (throw an exception here if you want)
    }

    return out;
}

int main()
{
    std::string s = "\u00EAtre";
    cout << s.length() << endl;

    std::wstring output;
    convert(reinterpret_cast<const unsigned char*> (s.c_str()), 
        reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));

    cout << output.length() << endl; // Actual length
}

该算法并不完全通用,因为 InputIterator 需要是一个unsigned char,因此您可以将每个字节解释为具有 0 到 0xFF 之间的值。 OutputIterator 是通用的(这样你就可以使用 std::back_inserter 而不必担心内存分配),但它作为通用参数的使用是有限的:基本上,它必须输出到一个足够大的元素数组来表示UTF-16 或 UTF-32 字符,例如 wchar_tuint32_t 或 C++0x char32_t 类型。另外,我没有包含用于转换大于 4 字节的字符字节序列的代码,但您应该从发布的内容中了解该算法的工作原理。

另外,如果您只想计算字符数,而不是输出到新的宽字符缓冲区,则可以修改算法以包含计数器而不是 OutputIterator。或者更好的是,只需使用 Marcelo Cantos 'answer 计算第一个字节。

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.

Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)

With that caveat noted, you can create a conversion function using the following algorithm:

template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out) 
{
    while (it != end) 
    {
        if (*it < 192) *out++ = *it++; // single byte character
        else if (*it < 224 && it + 1 < end && *(it+1) > 127) { 
            // double byte character
            *out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
            it += 2;
        }
        else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) { 
            // triple byte character
            *out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
            it += 3;
        }
        else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) { 
            // 4-byte character
            *out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
                ((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
            it += 4;
        }
        else ++it; // Invalid byte sequence (throw an exception here if you want)
    }

    return out;
}

int main()
{
    std::string s = "\u00EAtre";
    cout << s.length() << endl;

    std::wstring output;
    convert(reinterpret_cast<const unsigned char*> (s.c_str()), 
        reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));

    cout << output.length() << endl; // Actual length
}

The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.

Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

稍尽春風 2024-10-07 15:19:15

我建议您使用 UTF8-CPP。它是一个仅包含头文件的库,用于在 C++ 中使用 UTF-8。有了这个库,它看起来像这样:(

int LenghtOfUtf8String( const std::string &utf8_string ) 
{
    return utf8::distance( utf8_string.begin(), utf8_string.end() ); 
}

代码来自我的脑海。)

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:

int LenghtOfUtf8String( const std::string &utf8_string ) 
{
    return utf8::distance( utf8_string.begin(), utf8_string.end() ); 
}

(Code is from the top of my head.)

秋意浓 2024-10-07 15:19:15

我的大部分个人 C 库代码仅用英语进行过真正的测试,但以下是我实现 utf-8 字符串长度函数的方法。我最初基于此 wiki 页表中描述的位模式。现在这不是最易读的代码,但我的目的是删除循环中的任何分支。也很抱歉,当要求 C++ 时,这是 C 代码,它应该很容易地转换为 C++ 中的 std::string,尽管需要进行一些细微的修改。以下函数是从我的网站复制的,如果您有兴趣。

size_t utf8len(const char* const str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v0 = (*str & 0x80) >> 7;
        int v1 = (*str & 0x40) >> 6;
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
    }
    return len;
}

请注意,这不会验证任何字节(与此处所有其他建议的答案非常相似)。就我个人而言,我会将字符串验证与字符串长度函数分开,因为这不是它的责任。如果我们要将字符串验证移至另一个函数,我们可以执行如下所示的验证。

bool utf8valid(const char* const str) {
    if (str == NULL)
        return false;
    const char* c = str;
    bool valid = true;
    for (size_t i = 0; c[0] != 0 && valid;) {
        valid = (c[0] & 0x80) == 0
            || ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
            || ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
            || ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
        int v0 = (c[0] & 0x80) >> 7;
        int v1 = (c[0] & 0x40) >> 6;
        int v2 = (c[0] & 0x20) >> 5;
        int v3 = (c[0] & 0x10) >> 4;
        i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
        c = str + i;
    }
    return valid;
}

如果您追求可读性,我承认其他建议更具可读性。

更新:感谢 Max Brauer(留下评论)稍微简化了代码。经过他的简化,utf8len 将变成这样。

size_t utf8len(const char* str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v01 = ((*str & 0x80) >> 7) & ((*str & 0x40) >> 6);
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + ((v01 << v2) | (v01 & v3));
    }
    return len;
}

Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but my intent was to remove any branching from the loop. Also sorry for this being C code when asking for C++, it should translate over to std::string in C++ pretty easily though with some slight modifications. The below functions are copied from my website if you're interested.

size_t utf8len(const char* const str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v0 = (*str & 0x80) >> 7;
        int v1 = (*str & 0x40) >> 6;
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
    }
    return len;
}

Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.

bool utf8valid(const char* const str) {
    if (str == NULL)
        return false;
    const char* c = str;
    bool valid = true;
    for (size_t i = 0; c[0] != 0 && valid;) {
        valid = (c[0] & 0x80) == 0
            || ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
            || ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
            || ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
        int v0 = (c[0] & 0x80) >> 7;
        int v1 = (c[0] & 0x40) >> 6;
        int v2 = (c[0] & 0x20) >> 5;
        int v3 = (c[0] & 0x10) >> 4;
        i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
        c = str + i;
    }
    return valid;
}

If you are going for readability, I'll admit that other suggestions are a quite bit more readable.

Update: Thanks to Max Brauer (who left a comment) for simplifying the code a little bit. Here is what the utf8len would become with his simplification.

size_t utf8len(const char* str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v01 = ((*str & 0x80) >> 7) & ((*str & 0x40) >> 6);
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + ((v01 << v2) | (v01 & v3));
    }
    return len;
}
请持续率性 2024-10-07 15:19:15

尝试使用像 iconv 这样的编码库。
它可能有你想要的api。

另一种方法是实现您自己的 utf8strlen ,它确定每个代码点的长度并迭代代码点而不是字符。

try to use an encoding library like iconv.
it probably got the api you want.

an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

反话 2024-10-07 15:19:15

一种稍微懒惰的方法是只计算前导字节,但访问每个字节。这节省了解码各种前导字节大小的复杂性,但显然您需要付费才能访问所有字节,尽管通常没有那么多(2x-3x):

size_t utf8Len(std::string s)
{
  return std::count_if(s.begin(), s.end(),
    [](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}

请注意,某些代码值作为前导字节是非法的,这些代码值代表例如,比扩展 unicode 所需的 20 位更大的值,但其他方法无论如何都不知道如何处理该代码。

A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):

size_t utf8Len(std::string s)
{
  return std::count_if(s.begin(), s.end(),
    [](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}

Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.

风尘浪孓 2024-10-07 15:19:15

UTF-8 CPP 库有一个函数可以做到这一点。您可以将该库包含到您的项目中(它很小),也可以只查看该函数。 http://utfcpp.sourceforge.net/

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
为你拒绝所有暧昧 2024-10-07 15:19:15

这段代码是我从 php-iconv 移植到 c++ 的,你需要先使用 iconv,希望有用:

// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME   "UCS-4LE"

UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
    UInt32 retVal = (unsigned int)-1;

    unsigned int cnt = 0;

    iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
    if (cd == (iconv_t)(-1))
        return retVal;

    const char* in;
    size_t  inLeft;

    char *out;
    size_t outLeft;

    char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};

    for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2) 
    {
        size_t prev_in_left;
        out = buf;
        outLeft = sizeof(buf);

        prev_in_left = inLeft;

        if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
            if (prev_in_left == inLeft) {
                break;
            }
        }
    }
    iconv_close(cd);

    if (outLeft > 0)
        cnt -= outLeft / GENERIC_SUPERSET_NBYTES;

    retVal = cnt;
    return retVal;
}

UInt32 utf8StrLen(const std::string& src)
{
    return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}

This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:

// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME   "UCS-4LE"

UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
    UInt32 retVal = (unsigned int)-1;

    unsigned int cnt = 0;

    iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
    if (cd == (iconv_t)(-1))
        return retVal;

    const char* in;
    size_t  inLeft;

    char *out;
    size_t outLeft;

    char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};

    for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2) 
    {
        size_t prev_in_left;
        out = buf;
        outLeft = sizeof(buf);

        prev_in_left = inLeft;

        if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
            if (prev_in_left == inLeft) {
                break;
            }
        }
    }
    iconv_close(cd);

    if (outLeft > 0)
        cnt -= outLeft / GENERIC_SUPERSET_NBYTES;

    retVal = cnt;
    return retVal;
}

UInt32 utf8StrLen(const std::string& src)
{
    return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}
爱你是孤单的心事 2024-10-07 15:19:15

只是另一个简单的实现来计算 UTF-8 字符串中的字符

int utf8_strlen(const string& str)
{
    int c,i,ix,q;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        c = (unsigned char) str[i];
        if      (c>=0   && c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return 0;//invalid utf8
    }
    return q;
}

Just another naive implementation to count chars in UTF-8 string

int utf8_strlen(const string& str)
{
    int c,i,ix,q;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        c = (unsigned char) str[i];
        if      (c>=0   && c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return 0;//invalid utf8
    }
    return q;
}
静谧 2024-10-07 15:19:15

大多数(如果不是全部)其他答案只给出了 组合字符、表情符号或更复杂的脚本。例如,下面是 上面 user2781185 的解决方案在对 Godbolt 上的演示

Length (char-values):  5, length (code points):  4. String: café
Length (char-values): 6, length (code points): 5. String: café
Length (char-values): 15, length (code points): 5. String: 가각
Length (char-values): 24, length (code points): 8. String: ဂ︀င︀⋚︀丸︀
Length (char-values): 47, length (code points): 13. String:

Most (if not all) of the other answers only give the number of code points and completely fail for combining characters, emojis or more complex scripts. For example here's an example output from user2781185's solution above after modifying slightly for a demo on Godbolt:

Length (char-values):  5, length (code points):  4. String: café
Length (char-values):  6, length (code points):  5. String: café
Length (char-values): 15, length (code points):  5. String: 가각
Length (char-values): 24, length (code points):  8. String: ဂ︀င︀⋚︀丸︀
Length (char-values): 47, length (code points): 13. String: ????️‍????????‍????‍????‍????????????
Length (char-values): 74, length (code points): 21. String: ????‍????‍????‍????????‍????️????????‍❤️‍????‍????????
Length (char-values): 21, length (code points):  7. String: ফোল্ডার
Length (char-values): 18, length (code points):  8. String: dര്‍g1️⃣
Length (char-values): 18, length (code points):  6. String: Xല്‍????????

As you can see, the lengths returned are just the number of code points and made no relation whatsoever to what users see (“user-perceived characters”). Even the 2 café strings are different

To get the actual number of visible characters (called glyphs) you have to use a proper library like Boost.Unicode/Boost.Text/Boost.Locale or the official ICU from the Unicode Consortium to normalize the string to a non-combining form like NFC or NFKC first, then count the length in glyph

This is the sample code on how to do that:

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते café café ????‍????️????????‍♀️");
    UErrorCode errorCode;
    nfkc.normalize(str, errorCode); // ALWAYS NORMALIZE THE STRINGS FIRST

    {
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl;
    }

    return 0;
}

Another probably simply library for that purpose is yhirose/cpp-unicodelib

std::u32string s = U"hello☺????";
auto normalized = unicode::to_nfkc(s.c_str(), s.length());
std::cout << "Length: "
          << unicode::grapheme_count(normalized.c_str(), normalized.length())

See also:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文