如何在 Unicode 中将字符串设置为大写/小写？

发布于 2024-07-09 23:58:13 字数 636 浏览 11 评论 0原文

这主要是一个我很好奇的理论问题。（我并不是想通过自己编码或任何东西来做到这一点，我也不是重新发明轮子。）

我的问题是大写/小写等价表如何适用于 Unicode。

例如，如果我必须在 ASCII 中执行此操作，我会采用一个字符，如果它落在 [az] 范围内，我会计算 A 和 a 之间的差值。

如果它不在这个范围内，我会为 10 个左右的重音字符加上 ñ 建立一个小的等价表。（或者，我可以有一个包含 256 个条目的完整等价数组，其中大部分与输入相同）

但是，我猜测有更好的方法来指定 Unicode 中的等价项，因为有数百个数千个字符，理论上，可以添加新的语言或字符集（并且我希望发生这种情况时您不需要修补窗口）。

Windows 是否为每个字符都有一个巨大的硬编码等效表？或者说这是如何实现的？

一个相关的问题是 SQL Server 如何实现基于 Unicode 的不区分重音和大小写的查询。它是否有一个内部表告诉它 é ë è E É È 和 Ë 都相当于“e”？

在比较字符串时，这听起来不是很快。

如何快速访问索引？它是否已经将索引值转换为与该字段的排序规则相对应的“基本”字符？

有谁知道这些东西的内部原理吗？

谢谢你！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽梦忆笙歌 2024-07-16 23:58:13

我将解决这个问题的 MS SQL Server 部分，但“正确”答案实际上取决于支持的语言和应用程序。

在 SQL Server 中创建表时，每个文本字段都有隐式或显式指定的排序规则。这会影响排序顺序和比较行为。对于大多数英语（美国）区域设置，默认值为 Latin1_General_CI_AS 或 Latin 1，不区分大小写，区分重音。这意味着，例如，a=A，但 a!=ä 和 a!=ä。您还可以使用不区分重音的 (Latin1_General_CI_AI)，它将“A”的所有变音符号视为相同。

一些语言环境支持其他类别的比较；例如，法语对包含变音符号的单词的排序与德语有些不同。土耳其语认为无点 i 和带点 i 在语义上不同，因此，如果您使用土耳其语、不区分大小写、区分重音的排序规则，即使不区分大小写的比较， I 和 i 也不匹配。

您可以更改每个数据库、每个表、每个字段的排序规则，甚至可以更改每个查询的排序规则（需要一定的成本）。我的理解是，索引根据指定的排序规则进行规范化，这意味着索引基本上保留原始字符串的扁平版本。例如，对于不区分大小写的排序规则，Apple 和 apple 将存储为 apple。在搜索之前，查询会使用相同的排序规则进行扁平化。

在日语中，还有另一类规范化，其中全角和半角字符（如 ο=ｱ），在某些情况下，两个半角字符被展平为单个语义上等效的字符 (バ=ﾊﾞ)。最后，对于某些语言，还有另一种具有复合字符的蜡球，其中孤立的变音符号字符可以与其他字符组成（例如，ä 中的元音变音是一个字符，由简单形式 a 组成）。越南语、泰语和其他一些语言都有此类的变体。如果存在规范形式，则 Unicode 规范化允许将组合形式和分解形式视为等效形式。通常在进行任何比较之前应用 Unicode 规范化。

总而言之，对于不区分大小写的比较，您所做的事情与比较 ASCII 范围字符串时的操作非常相似：将比较的左侧和右侧展平为“小写”（例如），然后将数组作为二进制进行比较大批。不同之处在于你需要
1) 将字符串标准化为相同的 unicode 形式（kC 或 kD）
2）根据该语言环境的规则将字符串标准化为相同的大小写
3）根据重音敏感规则规范重音
4）根据二进制比较进行比较
4) 如果适用，例如在排序的情况下，使用附加的二级和三元排序规则进行比较，其中包括类似于某些语言中“Mc”在“M”之前排序的规则。

是的，Windows 存储了所有这些规则的表。默认情况下，您不会在每次安装中获得所有这些，除非您通过控制面板中的东亚语言支持和复杂脚本支持添加对它们的支持。

I'm going to address the MS SQL Server part of this question, but the "correct" answer actually depends on the language(s) supported and application.

When you create a table in SQL Server, each text field has either an implicitly or explicitly specified collation. This affects both sort order and comparison behavior. The default, for most English (US) locales, is Latin1_General_CI_AS, or Latin 1, Case-insensitive, Accent-Sensitive. That means that, for example, a=A, but a!=Ä and a!=ä. You can also use accent-insensitive (Latin1_General_CI_AI) which treats all the diacritic variations of "A" as equal.

Some locales support other categories of comparison; for example, French orders words containing diacritics somewhat differently than German does. Turkish considers a dotless i and dotted i semantically different, so I and i don't match even with case-insensitive comparisons if you use Turkish, case-insensitive, accent-sensitive collation.

You can change the collation per database, per table, per field, and, with some cost, even per-query. My understanding is that indices normalize according to the specified collation order, which means that basically the index keeps a flattened version of the original string. For example, with case-insensitive collations, Apple and apple are stored as apple. Queries are flattened with the same collation before the search.

In Japanese, there's another category of normalization, where fullwidth and halfwidth characters like ア=ｱ, and in some cases, two halfwidth characters are flattened to a single, semantically equivalent character (バ=ﾊﾞ). Finally, for some languages, there's another ball of wax with composite characters, where isolated diacritic characters can be composed with other characters (e.g. the umlaut in ä is one character, composed with the simple form a). Vietnamese, Thai and a few other languages have variations of this category. If there's a canonical form, Unicode normalization allows the composed and decomposed forms to be treated as equivalent. Unicode normalization is typically applied before any comparisons are made.

To summarize, for a case-insensitive comparison, you do something much like you would when comparing ASCII-range strings: flatten the left and right side of the comparison "to lower case" (for example), then compare the array as a binary array. The difference is that you need to
1) normalize the strings to the same unicode form (kC or kD)
2) normalize the strings to the same case according to the rules of that locale
3) normalize the accents according to the accent-sensitivity rules
4) compare according to a binary comparison
4) if applicable, such as in the case of sorting, compare using additional secondary and ternary sorting rules, which include things analogous to things like "Mc" sorts before "M" in some languages.

And yes, Windows stores tables for all of these rules. You don't get all of them by default in every installation, unless you add support for them with the East Asian Language Support and Complex Scripts support from control panel.

回复收藏 0 原文