Windows 中的 Unicode 标准化

发布于 2024-11-29 16:10:43 字数 856 浏览 0 评论 0原文

我在 Windows 中使用“unicode 字符串”已有很长时间了...我了解了 Unicode(例如毕业后)。然而,Win32API 非常宽松地提到“unicode”,这始终让我感到困惑。特别是,MSN提到的“unicode”变体是UTF-16(尽管“wide char”术语来自于它曾经是UCS-2,而不是Unicode)。然而,它几乎没有提到 Unicode 规范化。

MSN 有一些关于 Unicode 的页面Unicode 规范化用于更改规范化表单的表单和函数。关于标准化的页面甚至说:

Win32 和 .NET Framework 支持所有四种规范化形式。

但是,我在文档中没有找到 Win32 API 使用(或理解)什么规范化形式。

问题 1:用户输入(例如编辑控件)和通过 MultiByteToWideChar() 进行转换时默认使用什么规范化形式?

问题 2:传递给 Win32API 函数的字符串必须采用特定的规范化形式,还是内核和文件系统规范化不可知?

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.

MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:

Win32 and the .NET Framework support all four normalization forms.

However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.

Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?

Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

西瑶 2024-12-06 16:10:43

来自 MSDN 文章 使用 Unicode 规范化来表示字符串< /a>.

Windows、Microsoft 应用程序和 .NET Framework 通常使用普通输入方法生成 C 形式的字符。对于 Windows 上的大多数用途,形式 C 是首选形式。例如,形式 C 中的字符是由 Windows 键盘输入产生的。但是,从 Web 和其他平台导入的字符可能会在数据流中引入其他规范化形式。

更新:我已经添加了一些与问题#2 相关的具体细节。

关于文件系统,不需要规范化 - 基于文章 命名文件、路径和命名空间

无需对 Windows 文件 I/O API 函数使用的路径和文件名字符串执行任何 Unicode 规范化,因为文件系统将路径和文件名视为不透明的 WCHAR 序列。您的应用程序所需的任何规范化都应牢记这一点,在对相关 Windows 文件 I/O API 函数的任何调用之外。

对于 SQL Server,不需要标准化 - 数据保存在数据库。也就是说,在比较字符串时,SQL Server 2000 使用它自己的索引内部的字符串规范化机制;但我找不到具体细节。 SQL Server 2005 文章也有相同的说明

SQL Server 7.0 中的一个重要变化是提供了一种独立于操作系统的字符串比较模型,以便从 Windows 95 到 Windows 2000 的所有操作系统之间的排序规则保持一致。此字符串比较代码基于 Windows 2000 用于其自身字符串规范化的相同代码,并且经过封装,在所有计算机和所有版本的 SQL Server 中都相同。

From the MSDN article Using Unicode Normalization to Represent Strings.

Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.

Update: I've included some specific details relating to Question #2.

In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.

In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

负佳期 2024-12-06 16:10:43

用户输入默认使用什么规范化形式

取决于您的键盘布局/IME。如果您愿意,可以生成正常形式 C、D 或两者的疯狂混合。

键盘布局倾向于 NFC,因为在 Unicode 出现之前,它们通常会在每次按键时在本地代码页中输出一个字节字符。但也有例外。

例如,使用 Windows 越南语键盘布局,一些变音符号被键入为与字母组合的单个按键(例如音调â),而一些变音符号被键入为组合变音符号(例如坟墓à)。 graheme a-with-circumflex-and-grave 将被键入为 a-circumflex 后跟combining-grave,,这将是越南代码页 1258 中的 0xE2,0xCC,并且将输出为Unicode 中的 U+00E2、U+0300。

这不是正常形式 C(这将是 U+1EA7 带扬抑符和重音符的拉丁小写字母 A),也不是 D(这将是 U+0061, U+0302,U+0300)。

在 Windows 世界和网络上普遍存在对 NFC 的文化偏好,而在 Apple 世界中则存在对 NFD 的文化偏好。但它并没有严格执行,您应该能够应对任何组合和分解字符的混合。

内核和文件系统规范化是否不可知?

是的,内核和文件系统对标准化一无所知,并且很乐意允许您拥有名为 ầ.txtầ.txt的文件ầ.txt 在同一文件夹中。

what normalization form is used by default for user input

Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.

Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.

For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave ). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.

This isn't in normal form C (which would be U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).

There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.

are the kernel and file system normalization-agnostic?

Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.

葵雨 2024-12-06 16:10:43

首先,感谢您提出了一个很好的问题。我在 Michael Kaplan 的博客中找到了答案:

但是由于 Windows 上的所有文本输入方法都倾向于使用相同的规范化形式(形式 C),...

First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:

But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文