为什么不允许 UTF-8 作为“ANSI”编码?代码页?
Windows _setmbcp
函数允许任何有效的代码页...
(不支持 UTF-7 和 UTF-8 除外)
好吧,不支持 UTF-7 是有道理的:字符具有非唯一的表示形式,这会带来复杂性和安全风险。
但为什么不是UTF-8呢?
据我了解,Windows API 函数的“ANSI”版本将其参数转换为 UTF-16,调用等效的“W”函数,并将输出中的任何字符串转换为“ANSI”。这就是我一直手动做的事情。那么为什么 Windows 不能为我做这件事呢?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
“ANSI”代码页基本上是遗留的:Windows 9X 时代。无论如何,所有现代软件都应该基于 Unicode(即 UTF-16)。
基本上,当 Ansi 代码页最初设计时,甚至还没有发明 UTF-8,因此对多字节编码的支持相当随意(即大多数 Ansi 代码页都是单字节,除了一些东亚代码页)它们是一个或两个字节)。当所有新开发都应该以 UTF-16 进行时,添加对“正确”多字节编码的支持可能被认为不值得。
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp()
是 VC++ RTL 函数,而不是 Win32 API 函数。它仅影响 RTL 解释字符串的方式。它对 Win32 APIA
函数没有任何影响。当它们在内部调用对应的W
函数时,A
函数始终使用MultiByteToWideChar()
和WideCharToMultiByte()
指定代码页0 (CP_ACP
) 使用系统默认的 Ansi 代码页进行转换。_setmbcp()
is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 APIA
functions. When they call theirW
counterparts internally, theA
functions always useMultiByteToWideChar()
andWideCharToMultiByte()
specifying codepage 0 (CP_ACP
) to use the system default Ansi codepage for the conversions.微软的国际化专家 Michael Kaplan 试图在他的博客上回答这个问题。
基本上他的解释是,尽管 Windows API 函数的“ANSI”版本旨在处理不同的代码页,但历史上存在一种隐含的期望,即字符编码每个代码点最多需要两个字节。 UTF-8 无法满足这一期望,现在更改所有这些功能将需要大量的测试。
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
原因与jamesdlin的答案及其下面的评论中所说的完全一样:MBCS 与 Windows 中的 DBCS 相同,并且某些功能无法使用长度超过 2 个字节的字符
因此,在读/写等功能中允许使用 UTF-8,但在用作语言环境时则不允许使用。
但是 Microsoft 终于解决了这个问题,所以现在我们可以 使用 UTF-8 作为语言环境。事实上,MS 甚至再次开始推荐 ANSI API (
-A
),而不是像以前那样推荐 Unicode (-W
) 版本。 MSVC 中有一些新选项:/execution-charset:utf-8
和/utf-8
设置字符集,或者您也可以在 UWP 应用程序的 appxmanifest 中设置 ActiveCodePage 属性自 Windows 10 内部版本 17035 起,在引入这些选项之前,“Beta:使用 Unicode UTF-8 提供全球语言支持” 还添加了复选框,用于将区域设置代码页设置为 UTF-8
要打开该对话框,请打开开始菜单,键入“region”并选择 区域设置>附加日期、时间和区域设置>更改日期、时间或数字格式 >管理
启用后,您可以调用
setlocale()
更改为 UTF-8 语言环境:您也可以在较旧的 Windows 版本中使用此功能
另请参阅
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (
-A
) again instead of the Unicode (-W
) versions like before. There are some new options in MSVC:/execution-charset:utf-8
and/utf-8
to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP appSince Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call
setlocale()
to change to UTF-8 locale:You can also use this in older Windows versions
See also