WChars、编码、标准和可移植性
以下问题可能不属于 SO 问题;如果超出范围,请随时告诉我离开。这里的问题基本上是:“我是否正确理解了 C 标准,这是处理问题的正确方法吗?”
我想请求澄清、确认和更正我对 C(以及 C++ 和 C++0x)中字符处理的理解。首先,一个重要的观察:
可移植性和序列化是正交的概念。
可移植的东西是像 C、unsigned int、wchar_t 之类的东西。可序列化的东西是像uint32_t
或UTF-8这样的东西。 “可移植”意味着您可以重新编译相同的源代码并在每个支持的平台上获得工作结果,但二进制表示可能完全不同(或者甚至不存在,例如 TCP-over-Carrier Pig)。另一方面,可序列化的东西总是具有相同的表示形式,例如我可以在 Windows 桌面、手机或牙刷上读取的 PNG 文件。可移植的东西是内部的,可序列化的东西处理 I/O。可移植的东西是类型安全的,可序列化的东西需要类型双关。
当谈到 C 中的字符处理时,有两组分别与可移植性和序列化相关的事物:
wchar_t
、setlocale()
、mbsrtowcs( )
/wcsrtombs()
:C 标准没有提及“编码”;事实上,它与任何文本或编码属性完全无关。它只说“你的入口点是main(int, char**)
;你得到一个类型wchar_t
,它可以保存所有系统的字符;你得到读取输入的函数char-sequences 并将它们转换为可用的 wstring,反之亦然。iconv()
和 UTF-8,16,32:在明确定义的、明确的、固定的编码之间进行转码的函数/库。 iconv 处理的所有编码都被普遍理解和认可,但有一个例外。
具有 wchar_t
可移植字符类型的可移植、与编码无关的 C 世界与确定性外部世界之间的桥梁是 WCHAR-T 和 UTF 之间的 iconv 转换。
那么,我是否应该始终将字符串存储在与编码无关的 wstring 中,通过 wcsrtombs() 与 CRT 交互,并使用 iconv() 进行序列化?从概念上讲:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
实际上,这意味着我会为程序入口点编写两个样板包装器,例如 C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
这是仅使用纯标准 C 编写惯用的、可移植的、通用的、与编码无关的程序核心的正确方法吗? /C++,以及使用 iconv 的明确定义的 UTF I/O 接口? (请注意,诸如 Unicode 规范化或变音符号替换之类的问题超出了范围;只有在您决定真正需要 Unicode(而不是您可能喜欢的任何其他编码系统)之后,才需要处理这些问题具体细节,例如使用像 libicu 这样的专用库。)
更新
在许多非常好的评论之后,我想添加一些观察结果:
如果您的应用程序明确想要处理 Unicode 文本,您应该使
iconv
- 核心的转换部分,并在 UCS-4 内部使用uint32_t
/char32_t
- 字符串。Windows:虽然使用宽字符串通常没问题,但与控制台(就此而言,任何控制台)的交互似乎受到限制,因为似乎不支持任何合理的多字节控制台编码和
mbstowcs
本质上是无用的(除了微不足道的扩大)。例如,从 Explorer-drop 与GetCommandLineW
+CommandLineToArgvW
一起接收宽字符串参数是可行的(也许应该有一个单独的 Windows 包装器)。文件系统:文件系统似乎没有任何编码的概念,只是将任何以空结尾的字符串作为文件名。大多数系统采用字节字符串,但 Windows/NTFS 采用 16 位字符串。在发现哪些文件存在以及处理该数据时(例如,不构成有效UTF16 的
char16_t
序列(例如裸代理)是有效的NTFS 文件名),您必须小心。标准 Cfopen
无法打开所有 NTFS 文件,因为没有可能的转换来映射到所有可能的 16 位字符串。可能需要使用特定于 Windows 的_wfopen
。作为推论,通常没有明确定义的概念来表示给定的文件名包含“多少个字符”,因为首先没有“字符”的概念。买者自负。
The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int
, wchar_t
. Serializable things are things like uint32_t
or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t
,setlocale()
,mbsrtowcs()
/wcsrtombs()
: The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point ismain(int, char**)
; you get a typewchar_t
which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.iconv()
and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t
portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs()
, and use iconv()
for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the
iconv
-conversion part of the core and useuint32_t
/char32_t
-strings internally with UCS-4.Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and
mbstowcs
is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together withGetCommandLineW
+CommandLineToArgvW
works (perhaps there should be a separate wrapper for Windows).File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g.
char16_t
sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard Cfopen
is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific_wfopen
may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不,并且根本没有办法满足所有这些属性,至少如果您希望程序运行的话在 Windows 上。在 Windows 上,您几乎在任何地方都必须忽略 C 和 C++ 标准,而只使用
wchar_t
(不一定是内部的,而是在系统的所有接口上)。例如,如果您开始使用,则您已经失去了对命令行参数的 Unicode 支持。您必须
改为编写或使用 GetCommandLineW 函数,而 C 标准中没有指定这些函数。
更具体地说,
wchar_t
在 Windows 上是 UTF-16 代码单元,而char
在 Linux 上通常(并非总是)是 UTF-8 代码单元。编码意识通常是更理想的目标:确保您始终知道您使用哪种编码,或者使用将它们抽象出来的包装器库。我想我必须得出这样的结论:除非您愿意使用额外的库和特定于系统的扩展,并在其中投入大量精力,否则完全不可能用 C 或 C++ 构建可移植的支持 Unicode 的应用程序。不幸的是,大多数应用程序已经无法完成相对简单的任务,例如“将希腊字符写入控制台”或“以正确的方式支持系统允许的任何文件名”,而此类任务只是实现真正的 Unicode 支持的第一步。
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with
wchar_t
(not necessarily internally, but at all interfaces to the system). For example, if you start withyou have already lost Unicode support for command line arguments. You have to write
instead, or use the
GetCommandLineW
function, none of which is specified in the C standard.More specifically,
#ifdef
s.wchar_t
is a UTF-16 code unit on Windows and thatchar
is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.
我会避免使用
wchar_t
类型,因为它依赖于平台(根据您的定义不是“可序列化”):Windows 上的 UTF-16 和大多数类 Unix 系统上的 UTF-32。相反,请使用 C++0x/C1x 中的char16_t
和/或char32_t
类型。 (如果您没有新的编译器,暂时将它们键入为uint16_t
和uint32_t
。)DO 定义在 UTF-8 之间进行转换的函数。 8、UTF-16 和 UTF-32 功能。
不要编写每个字符串函数的重载窄/宽版本,就像 Windows API 使用 -A 和 -W 所做的那样。选择一个内部使用的首选编码,并坚持使用。对于需要不同编码的内容,请根据需要进行转换。
I would avoid the
wchar_t
type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use thechar16_t
and/orchar32_t
types from C++0x/C1x. (If you don't have a new compiler, typedef them asuint16_t
anduint32_t
for now.)DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.
DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.
wchar_t
的问题是与编码无关的文本处理太困难,应该避免。如果您坚持使用“纯 C”,如您所说,您可以使用所有w*
函数,例如wcscat
和朋友,但如果您想做任何更复杂的事情,那么你必须潜入深渊。下面是一些使用
wchar_t
比您只选择一种 UTF 编码要困难得多的事情:解析 Javascript:标识符可以包含 BMP 之外的某些字符(并且假设您关心 。
HTML:如何将
𐀀
转为wchar_t
字符串?文本编辑器:如何在
wchar_t
字符串中查找字素簇边界?如果我知道字符串的编码,我可以直接检查字符。如果我不知道编码,我必须希望我想对字符串执行的任何操作都是由某个库函数实现的。因此,
wchar_t
的可移植性有些无关紧要,因为我不认为它是特别有用的数据类型。您的程序要求可能有所不同,
wchar_t
可能适合您。The problem with
wchar_t
is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of thew*
functions likewcscat
and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.Here are some things that much harder with
wchar_t
than they are if you just pick one of the UTF encodings:Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn
𐀀
into a string ofwchar_t
?Text editor: How do you find grapheme cluster boundaries in a
wchar_t
string?If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of
wchar_t
is somewhat irrelevant as I don't consider it an especially useful data type.Your program requirements may differ and
wchar_t
may work fine for you.鉴于
iconv
不是“纯标准 C/C++”,我认为您不满足自己的规范。char32_t
和char16_t
附带了新的codecvt
方面,因此只要您保持一致并选择,我不认为您会出错如果分面在这里,则为一种字符类型+编码。这些方面在 22.5 [locale.stdcvt](来自 n3242)中进行了描述。
我不明白这至少不能满足您的一些要求:
那么您的代码可以使用
ns::string
,ns::char_t
,LIT'A'
&LIT“Hello, World!”
鲁莽地放弃,不知道底层的表示是什么。然后在需要时使用from_interfaceX(some_string)
。它也不影响全局区域设置或流。帮助器可以根据需要变得聪明,例如codecvt_utf8
可以处理“标头”,我认为这是来自 BOM 等棘手内容的标准语言(同上codecvt_utf16
)。事实上,我将上面的内容写得尽可能短,但您确实需要这样的帮助程序:
它使您可以访问每个
[from|to]_bytes
成员的 3 个重载,接受诸如例如const char*
或范围。Given that
iconv
is not "pure standard C/C++", I don't think you are satisfying your own specifications.There are new
codecvt
facets coming withchar32_t
andchar16_t
so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.The facets are described in 22.5 [locale.stdcvt] (from n3242).
I don't understand how this doesn't satisfy at least some of your requirements:
Then your code can use
ns::string
,ns::char_t
,LIT'A'
&LIT"Hello, World!"
with reckless abandon, without knowing what's the underlying representation. Then usefrom_interfaceX(some_string)
whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g.codecvt_utf8
can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (dittocodecvt_utf16
).In fact I wrote the above to be as short as possible but you'd really want helpers like this:
which give you access to the 3 overloads for each
[from|to]_bytes
members, accepting things like e.g.const char*
or ranges.