在 C++ 下处理 Unicode 字符串的最佳多平台方法是什么?

发布于 2024-08-17 13:05:39 字数 1168 浏览 2 评论 0原文

我知道 StackOverflow 上已经有几个关于 std::string 与 std::wstring 或类似问题,但没有一个提出完整的解决方案。

为了获得一个好的答案,我应该定义以下要求:

  • 多平台使用,必须在 Windows、OS X 和 Linux 上工作
  • 与平台特定 Unicode 字符串之间的转换例如 CFStringRefwchar_t *char* 作为 UTF-8 或操作系统 API 所需的其他类型。备注:我不需要代码页转换支持,因为我希望在所有支持的操作系统上仅使用 Unicode 兼容的函数。
  • 如果需要外部库,这个库应该是开源,并且遵循非常自由的许可证,例如 BSD,但不是 LGPL。
  • 能够使用printf格式语法或类似语法。
  • 字符串分配/释放
  • 性能的简单方法并不是很重要,因为我假设 Unicode 字符串仅用于应用程序 UI。
  • 可以举一些例子,

我真的很感激每个答案一个建议的解决方案,通过这样做,人们可能会投票支持他们喜欢的替代方案。如果您有多个选择,只需添加另一个答案。

请指出对您有用的事情

相关问题:

I know that there are already several questions on StackOverflow about std::string versus std::wstring or similar but none of them proposed a full solution.

In order to obtain a good answer I should define the requirements:

  • multiplatform usage, must work on Windows, OS X and Linux
  • minimal effort for conversion to/from platform specific Unicode strings like CFStringRef, wchar_t *, char* as UTF-8 or other types as they are required by OS API. Remark: I don't need code-page convertion support because I expect to use only Unicode compatible functions on all operating systems supported.
  • if requires an external library, this one should be open-source and under a very liberal license like BSD but not LGPL.
  • be able to use a printf format syntax or similar.
  • easy way of string allocation/deallocation
  • performance is not very important because I assume that the Unicode strings are used only for application UI.
  • some example could would be appreciated

I would really appreciate only one proposed solution per answer, by doing this people may vote for their prefered alternative. If you have more than one alternative just add another answer.

Please indicate something that did worked for you.

Related questions:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

无边思念无边月 2024-08-24 13:05:39

我强烈建议在应用程序内部使用 UTF-8,使用常规的旧 char*std::string 进行数据存储。为了与使用不同编码(ASCII、UTF-16 等)的 API 进行交互,我建议使用 libiconv,根据 LGPL 获得许可。

用法示例:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);

I would strongly recommend using UTF-8 internally in your application, using regular old char* or std::string for data storage. For interfacing with APIs that use a different encoding (ASCII, UTF-16, etc.), I'd recommend using libiconv, which is licensed under the LGPL.

Example usage:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);
标点 2024-08-24 13:05:39

与 Adam Rosenfield 答案(+1)相同,但我使用 UTFCPP 代替。

Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

你げ笑在眉眼 2024-08-24 13:05:39

我最近参与的一个项目决定使用 std::wstring 进行跨平台项目,因为“宽字符串是 Unicode,对吗?”这导致了一些令人头痛的问题:

  • wstring 中的标量值有多大?答:这取决于编译器的实现。在 Visual Studio (Win) 中,它是 16 位。但在 Xcode (Mac) 中,它是 32 位。
  • 这导致了使用 UTF-16 进行有线通信的不幸决定。但是哪种 UTF-16 呢?有两种:UTF-16BE(大端)和 UTF16-LE(小端)。不清楚这一点会导致更多错误。

当您使用特定于平台的代码时,使用平台的本机表示与其 API 进行通信是有意义的。但对于跨平台共享或在平台之间通信的任何代码,请避免所有歧义并使用 UTF-8。

I was recently on a project that decided to use std::wstring for a cross-platform project because "wide strings are Unicode, right?" This led to a number of headaches:

  • How big is the scalar value in a wstring? Answer: It's up to the compiler implementation. In Visual Studio (Win), it is 16 bits. But in Xcode (Mac), it is 32 bits.
  • This led to an unfortunate decision to use UTF-16 for communication over the wire. But which UTF-16? There are two: UTF-16BE (big-endian) and UTF16-LE (little-endian). Not being clear on this led to even more bugs.

When you are in platform-specific code, it makes sense to use the platform's native representation to communicate with its APIs. But for any code that is shared across platforms, or communicates between platforms, avoid all ambiguity and use UTF-8.

半﹌身腐败 2024-08-24 13:05:39

经验法则:使用本机平台 Unicode 形式进行处理(UTF-16 或 UTF-32),使用 UTF-8 进行数据交换(通信、存储)。

如果所有本机 API 都使用 UTF-16(例如在 Windows 中),则将字符串设置为 UTF-8 意味着您必须将所有输入转换为 UTF-16,调用 Win API,然后将答案转换为 UTF-8。相当痛苦。

但如果主要问题是 UI,那么字符串问题就很简单了。
比较难的是UI框架。
为此,我推荐 wxWidgets (http://www.wxWidgets.org)。支持许多平台,成熟(17 年了,仍然非常活跃),本机小部件,Unicode,自由许可证。

Rule of thumb: use the native platform Unicode form for processing (UTF-16 or UTF-32), and UTF-8 for data interchange (communication, storage).

If all the native APIs use UTF-16 (for instance in Windows), having your strings as UTF-8 means you will have to convert all input to UTF-16, call the Win API, then convert the answer to UTF-8. Quite a pain.

But if the main problem is the UI, the strings are the simple problem.
The more difficult one is the UI framework.
And for that I would recommend wxWidgets (http://www.wxWidgets.org). Supports many platforms, mature (17 years and still very active), native widgets, Unicode, liberal license.

清眉祭 2024-08-24 13:05:39

我会在内存中使用 UTF16 表示,在硬盘或线路上使用 UTF-8 或 16。主要原因:UTF16每个“字母”的大小是固定的。这简化了使用字符串时的许多职责(搜索、更换零件......)。

使用 UTF-8 的唯一原因是减少了“西方/拉丁”字母的内存使用量。您可以使用此表示形式进行光盘存储或通过网络进行传输。它还有一个好处是,在加载/保存到光盘/线路时,您无需担心字节顺序。

考虑到这些原因,我会在内部使用 std::wstring 或者 - 如果您的 GUI 库提供了 Widestring,请使用它(例如 QT 中的 QString)。对于光盘存储,我会为平台 api 编写一个独立于平台的小型包装器。或者我会检查 unicode.org 是否有可用于此转换的独立于平台的代码。


澄清一下:韩文/日文字母不是西方/拉丁字母。例如,日语是汉字。这就是我提到拉丁字符集的原因。


UTF-16 不是 1 个字符/2 个字节。此假设仅适用于基本多语言平面上的字符(请参阅:http://en.wikipedia。 org/wiki/UTF16)。大多数 UTF-16 用户仍然认为所有字符都在 BMP 上。如果您的应用程序无法保证这一点,您可以切换到 UTF32 或切换到 UTF8。

由于上述原因,许多 API(例如 Windows、QT、Java、.NET、wxWidgets)仍然使用 UTF-16

I'd go for UTF16 representation in memory and UTF-8 or 16 on harddisk or wire. The main reason: UTF16 has a fixed size for each "letter". This simplifies a lot of duties when working with the string (searching, replacing parts, ...).

The only reason for UTF-8 is the reduced memory usage for "western/latin" letters. You can use this representation for disc-storage or transportation over network. It has also the benefit that you need not worry over byte-order when loading/saving to disc/wire.

With these reasons in mind, I'd go for std::wstring internally or - if your GUI library offers a Widestring, use that (like QString from QT). And for disc-storage, I'd write a small platform independent wrapper for the platform api. Or I'd check out unicode.org if they have platformindependent code available for this conversion.


for clarification: korean / japanese letters are NOT western / latin. Japanese are for exampli Kanji. That's why I mentioned the latin character set.


for UTF-16 not being 1 character / 2 bytes. This assumption is only true for characters being on the base multilingual plane (see: http://en.wikipedia.org/wiki/UTF16). Still most user of UTF-16 assume that all characters are on the BMP. If this can't be guaranteed for your application, you can switch to UTF32 or switch to UTF8.

Still UTF-16 is used for the reasons mentioned above in a lot of APIs (e.g. Windows, QT, Java, .NET, wxWidgets)

云仙小弟 2024-08-24 13:05:39

可以将UTF-16存储在std::string中。因此原则上您可以在所有平台上使用 std::string ,并存储在平台首选的编码中(Linux 为 UTF-8,Windows 为 UTF-16, ETC。)。这将使您在 C++ 类型级别上得到一些简单的东西,但必须跟踪字符串的编码。如果应用程序是独立的,这可能很简单,如果它必须互操作(参见存储、有线格式),则不太简单。

将 UTF-16 存储在 std::string 中的风险在于,您迟早会调用 .c_str() 并且结果将被解释为以第一个 0 结尾,对于 std::string s = reinterpret_cast(L"hello") 将位于 s[1]

You can store UTF-16 inside std::string. So in principle you could use std::string for all platforms, and store inside the encoding preferred by the platform (UTF-8 for Linux, UTF-16 for Windows, etc.). This will leave you with something simple at the C++ types level, but having to track the encoding of strings. This may be simple if the application is self-contained, and less simple if it has to interoperate (cf. storage, wire format).

The risk of storing UTF-16 inside std::string is that sooner or later you will call .c_str() and the result will be interpreted as ending at the first 0, which for std::string s = reinterpret_cast<char *>(L"hello") will be at s[1].

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文