如何在 Unicode 中将字符串设置为大写/小写?
这主要是一个我很好奇的理论问题。 (我并不是想通过自己编码或任何东西来做到这一点,我也不是重新发明轮子。)
我的问题是大写/小写等价表如何适用于 Unicode。
例如,如果我必须在 ASCII 中执行此操作,我会采用一个字符,如果它落在 [az] 范围内,我会计算 A 和 a 之间的差值。
如果它不在这个范围内,我会为 10 个左右的重音字符加上 ñ 建立一个小的等价表。 (或者,我可以有一个包含 256 个条目的完整等价数组,其中大部分与输入相同)
但是,我猜测有更好的方法来指定 Unicode 中的等价项,因为有数百个数千个字符,理论上,可以添加新的语言或字符集(并且我希望发生这种情况时您不需要修补窗口)。
Windows 是否为每个字符都有一个巨大的硬编码等效表? 或者说这是如何实现的?
一个相关的问题是 SQL Server 如何实现基于 Unicode 的不区分重音和大小写的查询。 它是否有一个内部表告诉它 é ë è E É È 和 Ë 都相当于“e”?
在比较字符串时,这听起来不是很快。
如何快速访问索引? 它是否已经将索引值转换为与该字段的排序规则相对应的“基本”字符?
有谁知道这些东西的内部原理吗?
谢谢你!
This is mostly a theoretical question I'm just very curious about. (I'm not trying to do this by coding it myself or anything, I'm not reinventing wheels.)
My question is how the uppercase/lowercase table of equivalence works for Unicode.
For example, if I had to do this in ASCII, I'd take a character, and if it falls withing the [a-z] range, I'd sum the difference between A and a.
If it doesn't fall on that range, I'd have a small equivalence table for the 10 or so accented characters plus ñ.
(Or, I could just have a full equivalence array with 256 entries, most of which would be the same as the input)
However, I'm guessing that there's a better way of specifying the equivalences in Unicode, given that there are hundreds of thousands of characters, and that theoretically, a new language or set of characters can be added (and I'm expecting that you wouldn't need to patch windows when that happens).
Does Windows have a huge hard-coded equivalence table for each character? Or how is this implemented?
A related question is how SQL Server implements Unicode-based accent-insensitive and case-insensitive queries. Does it have an internal table that tells it that é ë è E É È and Ë are all equivalent to "e"?
That doesn't sound very fast when it comes to comparing strings.
How does it access Indexes quickly? Does it already index values converted to their "base" characters, corresponding to that field's collation?
Does anyone know the internals for these things?
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我将解决这个问题的 MS SQL Server 部分,但“正确”答案实际上取决于支持的语言和应用程序。
在 SQL Server 中创建表时,每个文本字段都有隐式或显式指定的排序规则。 这会影响排序顺序和比较行为。 对于大多数英语(美国)区域设置,默认值为 Latin1_General_CI_AS 或 Latin 1,不区分大小写,区分重音。 这意味着,例如,a=A,但 a!=ä 和 a!=ä。 您还可以使用不区分重音的 (Latin1_General_CI_AI),它将“A”的所有变音符号视为相同。
一些语言环境支持其他类别的比较; 例如,法语对包含变音符号的单词的排序与德语有些不同。 土耳其语认为无点 i 和带点 i 在语义上不同,因此,如果您使用土耳其语、不区分大小写、区分重音的排序规则,即使不区分大小写的比较, I 和 i 也不匹配。
您可以更改每个数据库、每个表、每个字段的排序规则,甚至可以更改每个查询的排序规则(需要一定的成本)。 我的理解是,索引根据指定的排序规则进行规范化,这意味着索引基本上保留原始字符串的扁平版本。 例如,对于不区分大小写的排序规则,Apple 和 apple 将存储为 apple。 在搜索之前,查询会使用相同的排序规则进行扁平化。
在日语中,还有另一类规范化,其中全角和半角字符(如 ο=ア),在某些情况下,两个半角字符被展平为单个语义上等效的字符 (バ=バ)。 最后,对于某些语言,还有另一种具有复合字符的蜡球,其中孤立的变音符号字符可以与其他字符组成(例如,ä 中的元音变音是一个字符,由简单形式 a 组成)。 越南语、泰语和其他一些语言都有此类的变体。 如果存在规范形式,则 Unicode 规范化允许将组合形式和分解形式视为等效形式。 通常在进行任何比较之前应用 Unicode 规范化。
总而言之,对于不区分大小写的比较,您所做的事情与比较 ASCII 范围字符串时的操作非常相似:将比较的左侧和右侧展平为“小写”(例如),然后将数组作为二进制进行比较大批。 不同之处在于你需要
1) 将字符串标准化为相同的 unicode 形式(kC 或 kD)
2)根据该语言环境的规则将字符串标准化为相同的大小写
3)根据重音敏感规则规范重音
4)根据二进制比较进行比较
4) 如果适用,例如在排序的情况下,使用附加的二级和三元排序规则进行比较,其中包括类似于某些语言中“Mc”在“M”之前排序的规则。
是的,Windows 存储了所有这些规则的表。 默认情况下,您不会在每次安装中获得所有这些,除非您通过控制面板中的东亚语言支持和复杂脚本支持添加对它们的支持。
I'm going to address the MS SQL Server part of this question, but the "correct" answer actually depends on the language(s) supported and application.
When you create a table in SQL Server, each text field has either an implicitly or explicitly specified collation. This affects both sort order and comparison behavior. The default, for most English (US) locales, is Latin1_General_CI_AS, or Latin 1, Case-insensitive, Accent-Sensitive. That means that, for example, a=A, but a!=Ä and a!=ä. You can also use accent-insensitive (Latin1_General_CI_AI) which treats all the diacritic variations of "A" as equal.
Some locales support other categories of comparison; for example, French orders words containing diacritics somewhat differently than German does. Turkish considers a dotless i and dotted i semantically different, so I and i don't match even with case-insensitive comparisons if you use Turkish, case-insensitive, accent-sensitive collation.
You can change the collation per database, per table, per field, and, with some cost, even per-query. My understanding is that indices normalize according to the specified collation order, which means that basically the index keeps a flattened version of the original string. For example, with case-insensitive collations, Apple and apple are stored as apple. Queries are flattened with the same collation before the search.
In Japanese, there's another category of normalization, where fullwidth and halfwidth characters like ア=ア, and in some cases, two halfwidth characters are flattened to a single, semantically equivalent character (バ=バ). Finally, for some languages, there's another ball of wax with composite characters, where isolated diacritic characters can be composed with other characters (e.g. the umlaut in ä is one character, composed with the simple form a). Vietnamese, Thai and a few other languages have variations of this category. If there's a canonical form, Unicode normalization allows the composed and decomposed forms to be treated as equivalent. Unicode normalization is typically applied before any comparisons are made.
To summarize, for a case-insensitive comparison, you do something much like you would when comparing ASCII-range strings: flatten the left and right side of the comparison "to lower case" (for example), then compare the array as a binary array. The difference is that you need to
1) normalize the strings to the same unicode form (kC or kD)
2) normalize the strings to the same case according to the rules of that locale
3) normalize the accents according to the accent-sensitivity rules
4) compare according to a binary comparison
4) if applicable, such as in the case of sorting, compare using additional secondary and ternary sorting rules, which include things analogous to things like "Mc" sorts before "M" in some languages.
And yes, Windows stores tables for all of these rules. You don't get all of them by default in every installation, unless you add support for them with the East Asian Language Support and Complex Scripts support from control panel.
有一个映射文件,其中包含具有 1:1 映射比例的所有大小写映射。 通常操作系统/框架/库支持特定版本的 Unicode,并且由于这种情况映射文件是版本化的,因此您将获得特定操作系统/框架/库/无论发生支持的任何版本的 Unicode 的映射。
有关 Unicode 大小写映射的详细信息,请参阅:http://www.unicode.org/faq/casemap_charprop .html
There is a mapping file that contains all the case mappings that have a 1:1 mapping ratio. Usually operating systems/frameworks/libraries support a specific version of Unicode, and since this case mappings file is versioned, you would get the mappings for whichever version of Unicode your particular OS/framework/library/whatever happened to support.
For more information on Unicode case mappings, see: http://www.unicode.org/faq/casemap_charprop.html
大多数书写系统没有单独的大写和小写字母。 根据维基百科,例外情况包括“罗马、希腊、西里尔和亚美尼亚字母”。
所以不用担心那么多信件。 此页面显示大范围的字符遵循一个简单的方案,即在大写字符上加 1获取等效的小写字母(当然也有一些例外)。
Most writing systems do not have separate uppercase and lowercase letters. According to Wikipedia, exceptions include "Roman, Greek, Cyrillic and Armenian alphabets".
So there aren't that many letters to worry about. This page shows that large ranges of characters follow a simple scheme of adding 1 to an uppercase character to get the lowercase equivalent (though of course there are some exceptions).
正确的答案有点复杂,具体取决于您想要做什么。
在比较字符串、排序或搜索应用程序时,要使用的正确算法在 UTS #10: "Unicode排序算法”。 不区分大小写是其中的一部分,但是有不同的方式来表示许多字符,并且应用程序通常需要将各种表示视为等效。
排序规则取决于区域设置。 当您对结果进行排序以显示给用户时,这主要是一个问题。 忽视规则可能会让用户感到沮丧,甚至导致安全漏洞。
如果您只是为了显示目的而尝试将单词大写,那么那里的规则也可能很棘手; 存在一对多转换等问题。 根据区域设置的不同,同一字母的大写形式可能有所不同。 字母在单词中的位置可能会产生影响。 还有一个独特的“标题大小写”概念,您只需将每个单词的第一个字母大写即可。 有时字符的标题大小写与其大写字母不同。
The correct answer is a little more complicated, depending on what you are trying to do.
When comparing character strings, for sorting or searching applications, the correct algorithm to use is specified in UTS #10: "Unicode Collation Algorithm". Case-insensitivity is part of the mix, but there are different ways to represent a many characters, and applications often need to treat the various representations as equivalent.
The sorting rules are locale-dependent. This is mainly an issue when you are sorting results for display to a user. Ignoring the rules can frustrate users and even result in security vulnerabilities.
If you are just trying to capitalize words for display purposes, the rules there can be tricky too; there are one-to-many conversions and other issues. Depending on the locale, the same letter may capitalize differently. The letter's position in a word can make a difference. There's also a a distinct notion of "title case", where you just want to capitalize the first letter of each word. Sometimes the title-case of a character is not the same as its upper-case.