WChars、编码、标准和可移植性

发布于 2024-11-14 12:38:08 字数 3635 浏览 2 评论 0原文

以下问题可能不属于 SO 问题;如果超出范围,请随时告诉我离开。这里的问题基本上是:“我是否正确理解了 C 标准,这是处理问题的正确方法吗?”

我想请求澄清、确认和更正我对 C(以及 C++ 和 C++0x)中字符处理的理解。首先,一个重要的观察:

可移植性和序列化是正交的概念。

可移植的东西是像 C、unsigned int、wchar_t 之类的东西。可序列化的东西是像uint32_t或UTF-8这样的东西。 “可移植”意味着您可以重新编译相同的源代码并在每个支持的平台上获得工作结果,但二进制表示可能完全不同(或者甚至不存在,例如 TCP-over-Carrier Pig)。另一方面,可序列化的东西总是具有相同的表示形式,例如我可以在 Windows 桌面、手机或牙刷上读取的 PNG 文件。可移植的东西是内部的,可序列化的东西处理 I/O。可移植的东西是类型安全的,可序列化的东西需要类型双关。

当谈到 C 中的字符处理时,有两组分别与可移植性和序列化相关的事物:

  • wchar_tsetlocale()mbsrtowcs( )/wcsrtombs()C 标准没有提及“编码”;事实上,它与任何文本或编码属性完全无关。它只说“你的入口点是 main(int, char**);你得到一个类型 wchar_t ,它可以保存所有系统的字符;你得到读取输入的函数char-sequences 并将它们转换为可用的 wstring,反之亦然。

  • iconv() 和 UTF-8,16,32:在明确定义的、明确的、固定的编码之间进行转码的函数/库。 iconv 处理的所有编码都被普遍理解和认可,但有一个例外。

具有 wchar_t 可移植字符类型的可移植、与编码无关的 C 世界与确定性外部世界之间的桥梁是 WCHAR-T 和 UTF 之间的 iconv 转换。

那么,我是否应该始终将字符串存储在与编码无关的 wstring 中,通过 wcsrtombs() 与 CRT 交互,并使用 iconv() 进行序列化?从概念上讲:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

实际上,这意味着我会为程序入口点编写两个样板包装器,例如 C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

这是仅使用纯标准 C 编写惯用的、可移植的、通用的、与编码无关的程序核心的正确方法吗? /C++,以及使用 iconv 的明确定义的 UTF I/O 接口? (请注意,诸如 Unicode 规范化或变音符号替换之类的问题超出了范围;只有在您决定真正需要 Unicode(而不是您可能喜欢的任何其他编码系统)之后,才需要处理这些问题具体细节,例如使用像 libicu 这样的专用库。)

更新

在许多非常好的评论之后,我想添加一些观察结果:

  • 如果您的应用程序明确想要处理 Unicode 文本,您应该使iconv - 核心的转换部分,并在 UCS-4 内部使用 uint32_t/char32_t - 字符串。

  • Windows:虽然使用宽字符串通常没问题,但与控制台(就此而言,任何控制台)的交互似乎受到限制,因为似乎不支持任何合理的多字节控制台编码和 mbstowcs 本质上是无用的(除了微不足道的扩大)。例如,从 Explorer-drop 与 GetCommandLineW+CommandLineToArgvW 一起接收宽字符串参数是可行的(也许应该有一个单独的 Windows 包装器)。

  • 文件系统:文件系统似乎没有任何编码的概念,只是将任何以空结尾的字符串作为文件名。大多数系统采用字节字符串,但 Windows/NTFS 采用 16 位字符串。在发现哪些文件存在以及处理该数据时(例如,不构成有效UTF16 的char16_t 序列(例如裸代理)是有效的NTFS 文件名),您必须小心。标准 C fopen 无法打开所有 NTFS 文件,因为没有可能的转换来映射到所有可能的 16 位字符串。可能需要使用特定于 Windows 的 _wfopen。作为推论,通常没有明确定义的概念来表示给定的文件名包含“多少个字符”,因为首先没有“字符”的概念。买者自负。

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"

I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:

Portability and serialization are orthogonal concepts.

Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>

When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:

  • wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.

  • iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.

The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.

So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)

Updates

Following many very nice comments I'd like to add a few observations:

  • If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.

  • Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).

  • File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

云胡 2024-11-21 12:38:08

这是仅使用纯标准 C/C++ 编写惯用的、可移植的、通用的、与编码无关的程序核心的正确方法

不,并且根本没有办法满足所有这些属性,至少如果您希望程序运行的话在 Windows 上。在 Windows 上,您几乎在任何地方都必须忽略 C 和 C++ 标准,而只使用 wchar_t(不一定是内部的,而是在系统的所有接口上)。例如,如果您开始使用,则

int main(int argc, char** argv)

您已经失去了对命令行参数的 Unicode 支持。您必须

int wmain(int argc, wchar_t** argv)

改为编写或使用 GetCommandLineW 函数,而 C 标准中没有指定这些函数。

更具体地说,

  • Windows 上任何支持 Unicode 的程序都必须主动忽略 C 和 C++ 标准,例如命令行参数、文件和控制台 I/O,或者文件和目录操作。这当然不是惯用的做法。请改用 Microsoft 扩展或包装器,例如 Boost.Filesystem 或 Qt。
  • 可移植性是极难实现的,特别是对于 Unicode 支持。你真的必须做好准备,你认为你所知道的一切都可能是错误的。例如,您必须考虑用于打开文件的文件名可能与实际使用的文件名不同,并且两个看似不同的文件名可能代表同一个文件。创建两个文件 ab 后,您最终可能会得到一个文件 c 或两个文件 d > 和 e,其文件名与您传递给操作系统的文件名不同。您要么需要一个外部包装器库,要么需要大量的#ifdef。
  • 编码不可知性通常在实践中行不通,特别是如果您想要可移植的话。您必须知道,wchar_t 在 Windows 上是 UTF-16 代码单元,而 char 在 Linux 上通常(并非总是)是 UTF-8 代码单元。编码意识通常是更理想的目标:确保您始终知道您使用哪种编码,或者使用将它们抽象出来的包装器库。

我想我必须得出这样的结论:除非您愿意使用额外的库和特定于系统的扩展,并在其中投入大量精力,否则完全不可能用 C 或 C++ 构建可移植的支持 Unicode 的应用程序。不幸的是,大多数应用程序已经无法完成相对简单的任务,例如“将希腊字符写入控制台”或“以正确的方式支持系统允许的任何文件名”,而此类任务只是实现真正的 Unicode 支持的第一步。

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

int main(int argc, char** argv)

you have already lost Unicode support for command line arguments. You have to write

int wmain(int argc, wchar_t** argv)

instead, or use the GetCommandLineW function, none of which is specified in the C standard.

More specifically,

  • any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
  • Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
  • Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

浊酒尽余欢 2024-11-21 12:38:08

我会避免使用 wchar_t 类型,因为它依赖于平台(根据您的定义不是“可序列化”):Windows 上的 UTF-16 和大多数类 Unix 系统上的 UTF-32。相反,请使用 C++0x/C1x 中的 char16_t 和/或 char32_t 类型。 (如果您没有新的编译器,暂时将它们键入为 uint16_tuint32_t。)

DO 定义在 UTF-8 之间进行转换的函数。 8、UTF-16 和 UTF-32 功能。

不要编写每个字符串函数的重载窄/宽版本,就像 Windows API 使用 -A 和 -W 所做的那样。选择一个内部使用的首选编码,并坚持使用。对于需要不同编码的内容,请根据需要进行转换。

I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)

DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.

DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

雨夜星沙 2024-11-21 12:38:08

wchar_t 的问题是与编码无关的文本处理太困难,应该避免。如果您坚持使用“纯 C”,如您所说,您可以使用所有 w* 函数,例如 wcscat 和朋友,但如果您想做任何更复杂的事情,那么你必须潜入深渊。

下面是一些使用 wchar_t 比您只选择一种 UTF 编码要困难得多的事情:

  • 解析 Javascript:标识符可以包含 BMP 之外的某些字符(并且假设您关心 。

  • HTML:如何将 𐀀 转为 wchar_t 字符串?

  • 文本编辑器:如何在 wchar_t 字符串中查找字素簇边界?

如果我知道字符串的编码,我可以直接检查字符。如果我不知道编码,我必须希望我想对字符串执行的任何操作都是由某个库函数实现的。因此,wchar_t 的可移植性有些无关紧要,因为我不认为它是特别有用的数据类型。

您的程序要求可能有所不同,wchar_t 可能适合您。

The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.

Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:

  • Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).

  • HTML: How do you turn 𐀀 into a string of wchar_t?

  • Text editor: How do you find grapheme cluster boundaries in a wchar_t string?

If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.

Your program requirements may differ and wchar_t may work fine for you.

巷子口的你 2024-11-21 12:38:08

鉴于 iconv 不是“纯标准 C/C++”,我认为您不满足自己的规范。

char32_tchar16_t 附带了新的 codecvt 方面,因此只要您保持一致并选择,我不认为您会出错如果分面在这里,则为一种字符类型+编码。

这些方面在 22.5 [locale.stdcvt](来自 n3242)中进行了描述。


我不明白这至少不能满足您的一些要求:

namespace ns {

typedef char32_t char_t;
using std::u32string;

// or use user-defined literal
#define LIT u32

// Communicate with interface0, which wants utf-8

// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;

inline std::string
to_interface0(string const& s)
{
    return converter0().to_bytes(s);
}

inline string
from_interface0(std::string const& s)
{
    return converter0().from_bytes(s);
}

// Communitate with interface1, which wants utf-16

// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;

inline std::wstring
to_interface0(string const& s)
{
    return converter1().to_bytes(s);
}

inline string
from_interface0(std::wstring const& s)
{
    return converter1().from_bytes(s);
}

} // ns

那么您的代码可以使用 ns::string, ns::char_t, LIT'A' & LIT“Hello, World!” 鲁莽地放弃,不知道底层的表示是什么。然后在需要时使用 from_interfaceX(some_string) 。它也不影响全局区域设置或流。帮助器可以根据需要变得聪明,例如 codecvt_utf8 可以处理“标头”,我认为这是来自 BOM 等棘手内容的标准语言(同上 codecvt_utf16)。

事实上,我将上面的内容写得尽可能短,但您确实需要这样的帮助程序:

template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
    return converter0().from_bytes(std::forward<T>(t)...);
}

它使您可以访问每个 [from|to]_bytes 成员的 3 个重载,接受诸如例如 const char* 或范围。

Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.

There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.

The facets are described in 22.5 [locale.stdcvt] (from n3242).


I don't understand how this doesn't satisfy at least some of your requirements:

namespace ns {

typedef char32_t char_t;
using std::u32string;

// or use user-defined literal
#define LIT u32

// Communicate with interface0, which wants utf-8

// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;

inline std::string
to_interface0(string const& s)
{
    return converter0().to_bytes(s);
}

inline string
from_interface0(std::string const& s)
{
    return converter0().from_bytes(s);
}

// Communitate with interface1, which wants utf-16

// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;

inline std::wstring
to_interface0(string const& s)
{
    return converter1().to_bytes(s);
}

inline string
from_interface0(std::wstring const& s)
{
    return converter1().from_bytes(s);
}

} // ns

Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).

In fact I wrote the above to be as short as possible but you'd really want helpers like this:

template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
    return converter0().from_bytes(std::forward<T>(t)...);
}

which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文