wstring.size() 在 xcode 或 Visual C++ 中的工作方式不同

发布于 2025-01-04 19:13:37 字数 1254 浏览 6 评论 0原文

我运行了相同的代码来确定宽字符字符串中的字符数。测试的字符串有ascii、数字和韩语。

#include <iostream>

using namespace std;

template <class T,class trait>
void DumpCharacters(T& a)
{
    size_t length = a.size();
    for(size_t i=0;i<length;i++)
    {
        trait n = a[i];
        cout<<i<<" => "<<n<<endl;
    }

    cout<<endl;
}

int main(int argc, char* argv[])
{
    wstring u = L"123abc가1나1다";
    wcout<<u<<endl;
    DumpCharacters<wstring,wchar_t>(u);

    string s = "123abc가1나1다";
    cout<<s<<endl;
    DumpCharacters<string,char>(s);

    return 0;
}

显而易见的是，Visual C++ 2010 中的 wstring.size() 返回字母的数量（11 个字符），无论它是 ascii 还是国际字符。但是，它在 Mac OS X 的 XCode 4.2 中返回字符串数据的字节数（17 个字节）。

请回复我如何获取宽字符字符串的字符长度，而不是 xcode 中的字节数。

--- 2 月 12 日添加 --

我发现 wcslen() 在 xcode 中也返回 17。在 vc++ 中它返回 11。这是经过测试的代码：

const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);

--- 2 月 18 日添加 --

我发现 llvm 3.0 导致长度错误。将编译器前端从 llvm3.0 更改为 4.2

wcslen( ）在 Xcode 和 VC++ 中的工作方式不同详细说明。

原文

I ran the same code which determines number of characters in a wide-character string. The tested string has ascii, numbers and Korean language.

#include <iostream>

using namespace std;

template <class T,class trait>
void DumpCharacters(T& a)
{
    size_t length = a.size();
    for(size_t i=0;i<length;i++)
    {
        trait n = a[i];
        cout<<i<<" => "<<n<<endl;
    }

    cout<<endl;
}

int main(int argc, char* argv[])
{
    wstring u = L"123abc가1나1다";
    wcout<<u<<endl;
    DumpCharacters<wstring,wchar_t>(u);

    string s = "123abc가1나1다";
    cout<<s<<endl;
    DumpCharacters<string,char>(s);

    return 0;
}

The obvious thing is that wstring.size() in Visual C++ 2010 returns the number of letters (11 characters), regardless if it is ascii or international character. However, it returns the byte count of string data (17 bytes) in XCode 4.2 in Mac OS X.

Please reply me how to get the character length of a wide-character string, not byte count in xcode.

--- added on 12 Feb --

I found that wcslen() also returns 17 in xcode. it returns 11 in vc++.
Here's the tested code:

const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);

--- added on 18 Feb --

I found that llvm 3.0 causes the wrong length. This problem is fixed after changing compiler frontend from llvm3.0 to 4.2

wcslen() works differently in Xcode and VC++ says the details.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你げ笑在眉眼 2025-01-11 19:13:37

如果 std::wstring 版本使用 17 个字符，则会出现错误：它应该只使用 11 个字符。使用 gcc 和 clang 的最新 SVN 头，它使用 11 个字符作为 std::wstring 和 17 个字符作为 std::string。我想这就是我们所期望的。

请注意，标准 C++ 库内部对“字符”的概念与多字编码时的预期不同（例如，char 类型的字为 UTF-8，而 UTF-16对于 16 位的字）使用。这是描述字符串的章节的第一段（21.1 [strings.general]）：

本条款描述了用于操作任何非数组 POD (3.9) 类型序列的组件。在本子句中，此类类型称为类 char 类型，类 char 类型的对象称为类 char 对象或简称为字符。

这基本上意味着，在使用 Unicode 时，各种函数不会关注代码点的构成，而是将字符串作为单词序列进行处理。这是严重的影响，并且会发生什么，例如在生成子字符串时，因为这些子字符串可能很容易将多字节字符分开。目前，标准 C++ 库不支持在内部处理多字节编码，因为它假定在读取数据时完成从编码到字符的转换（相应地，在写入数据时以另一种方式完成）。如果您在内部处理多字节编码字符串，则需要注意这一点，因为根本不支持。

人们认识到这种状况实际上是一个问题。对于 C++2011，添加了字符类型 char32_t，它对 Unicode 字符的支持应该比 wchar_t 更好（因为 Unicode 使用 20 位，而 wchar_t 是允许仅支持 16 位，这是在 Unicode 承诺最多使用 16 位时在某些平台上做出的选择）。但是，这仍然无法处理组合字符。 C++ 委员会认识到这是一个问题，并且在标准 C++ 库中进行适当的字符处理将是一件好事，但到目前为止，没有人提出一个全面的提案来解决这个问题（如果您觉得您想要）提出类似的建议，但您不知道如何提出，请随时与我联系，我将帮助您如何提交提案）。

It is an error if the std::wstring version uses 17 characters: it should only use 11 characters. Using recent SVN heads of gcc and clang it uses 11 characters for the std::wstring and 17 characters for the std::string. I think this is what expected.

Please note that the standard C++ library internally has a different idea of what a "character" is than what might be expected when multi-word encodings (e.g. UTF-8 for words of type char and UTF-16 for words with 16 bits) are used. Here is the first paragraph of the chapter describing string (21.1 [strings.general]):

This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types , and objects of char-like types are called char-like objects or simply characters.

This basically means that when using Unicode the various functions won't pay attention to what constitutes a code point but rather process the strings as a sequence of words. This is severe impacts and what will happen e.g. when producing substrings because these may easily split multi-byte characters apart. Currently, the standard C++ library doesn't have any support for processing multi-bytes encodings internally because it is assumed that the translation from an encoding to characters is done when reading data (and correspondingly the other way when writing data). If you are processing multi-byte encoded strings internally, you need be aware of this as there is no support at all.

It is recognized that this state of affairs is actually a problem. For C++2011 the character type char32_t was added which should support Unicode character still better than wchar_t (because Unicode uses 20 bits while wchar_t was allowed to only support 16 bits which is a choice made on some platforms at a time when Unicode was promising to use at most 16 bits). However, this would still not deal with combining characters. It is recognized by the C++ committee that this is a problem and that proper character processing in the standard C++ library would be something nice to have but so far nobody as come forward with a comprehensive proposal to address this problem (if you feel you want to propose something like this but you don't know how, please feel free to contact me and I will help you with how to submit a proposal).

回复收藏 0 原文

忘东忘西忘不掉你 2025-01-11 19:13:37

XCode 4.2 显然在初始化 string 时使用 UTF-8（或非常类似的东西）作为窄多字节编码来表示程序源代码中的字符串文字 "123abcі1나1다"。该字符串的 UTF-8 表示形式恰好有 17 个字节长。

宽字符表示（存储在 u 中）是 11 个宽字符。有很多方法可以将窄编码转换为宽编码。试试这个：

#include <iostream>
#include <clocale>
#include <cstdlib>

int main()
{
    std::wstring u = L"123abc가1나1다";
    std::cout << "Wide string containts " << u.size() << " characters\n";

    std::string s = "123abc가1나1다";
    std::cout << "Narrow string contains " << s.size() << " bytes\n";

    std::setlocale(LC_ALL, "");
    std::cout << "Which can be converted to "
              << std::mbstowcs(NULL, s.c_str(), s.size())
              << " wide characters in the current locale,\n";
}

XCode 4.2 apparently used UTF-8 (or something very similar) as narrow multibyte encoding to represent your characters string literal "123abc가1나1다" in the program's source code when initializing string s. The UTF-8 representation of that string happens to be 17 bytes long.

The wide character representation (stored in u) is 11 wide characters. There are many ways to convert from narrow to wide encoding. Try this:

#include <iostream>
#include <clocale>
#include <cstdlib>

int main()
{
    std::wstring u = L"123abc가1나1다";
    std::cout << "Wide string containts " << u.size() << " characters\n";

    std::string s = "123abc가1나1다";
    std::cout << "Narrow string contains " << s.size() << " bytes\n";

    std::setlocale(LC_ALL, "");
    std::cout << "Which can be converted to "
              << std::mbstowcs(NULL, s.c_str(), s.size())
              << " wide characters in the current locale,\n";
}

回复收藏 0 原文