SQL Server 排序规则选择

发布于 2024-11-10 10:16:45 字数 628 浏览 3 评论 0原文

今晚我花了很多时间试图找到有关在 SQL Server 2008 R2 安装中应用哪种排序规则的指南,但几乎所有在线内容基本上都在说“选择适合您的排序规则”。 极其没有帮助。

我的背景是新应用程序开发。我不担心与先前版本的 SQL Server(即 <= 2005)的向后兼容性。我对存储代表全球语言的数据非常感兴趣 - 不仅仅是拉丁语。我在网上找到的很少的帮助表明我应该避免所有“SQL_”排序规则。这将我的选择范围缩小到使用基于 Windows 区域设置的二进制或“非二进制”排序规则。

如果我使用二进制,我认为我应该使用“BIN2”。这就是我的问题。如何确定是否应该使用 BIN2 还是仅使用“Latin1_General_100_XX_XX_XX”?我的蜘蛛感知告诉我,BIN2 将提供“不太准确”的排序规则,但对所有语言都更通用(而且速度更快!)。我还怀疑二进制排序规则区分大小写、区分重音和区分假名(是吗?)。相反,我怀疑非二进制排序规则最适合基于拉丁语的语言。

该文档不支持我的上述主张,我正在做出有根据的猜测。但这就是问题所在!为什么在线文档如此薄弱,以至于只能靠猜测来选择?甚至“SQL Server 2008 Internals”一书也讨论了各种选择,但没有解释为什么以及何时选择二进制排序规则(与非二进制 Windows 排序规则相比)。犯罪!!!

I've spent a lot of time this evening trying to find guidance about which choice of collation to apply in my SQL Server 2008 R2 installation, but almost everything online basically says "choose what is right for you." Extremely unhelpful.

My context is new application development. I am not worrying about backward compatibility with a prior version of SQL Server (viz. <= 2005). I am very interested in storing data representing languages from around the globe - not just Latin based. What very little help I've found online suggests I should avoid all "SQL_" collations. This narrows my choice to using either a binary or "not binary" collation based on the Windows locale.

If I use binary, I gather I should use "BIN2." So this is my question. How do I determine whether I should use BIN2 or just "Latin1_General_100_XX_XX_XX"? My spider-sense tells me that BIN2 will provide collation that is "less accurate," but more generic for all languages (and fast!). I also suspect the binary collation is case sensitive, accent sensitive, and kana-sensitive (yes?). In contrast, I suspect the non-binary collation would work best for Latin-based languages.

The documentation doesn't support my claims above, I'm making educated guesses. But this is the problem! Why is the online documentation so thin that the choice is left to guesswork? Even the book "SQL Server 2008 Internals" discussed the variety of choices, without explaining why and when binary collation would be chosen (compared with non-binary windows collation). Criminy!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

べ繥欢鉨o。 2024-11-17 10:16:45

恕我直言,“SQL Server 2008 Internals”对这个主题进行了很好的讨论。

二进制排序规则很棘手,如果你打算支持人类的文本搜索,你最好选择非二进制排序规则。如果您已经调整了其他所有内容(首先是架构),并且在区分大小写和区分重音是所需行为的情况下(例如密码哈希),则二进制文件可以很好地获得一点性能。二进制排序规则实际上“更精确”,因为它不考虑相似的文本。不过,您从中得到的排序顺序仅适用于机器。

SQL_* 排序规则和本机 Windows 排序规则之间只有细微的差别。如果您不受兼容性的限制,请选择本机的,因为据我所知,它们是前进的方向。

排序规则决定排序顺序和相等性。您可以选择真正最适合您的用户的内容。据了解,您将为数据使用 unicode 类型(如 nvarchar)以支持国际文本。排序规则会影响非 unicode 列中可以存储的内容,但这不会影响您。

真正重要的是避免在 WHERE 子句中混合排序规则,因为这就是不使用索引而付出罚款的地方。据我所知,没有支持所有语言的银弹排序规则。您可以为大多数用户选择一个,也可以为每种语言提供不同的本地化支持栏。

一件重要的事情是让服务器排序规则与数据库排序规则相同。如果您计划使用临时表作为临时表(如果使用“CREATE TABLE #ttt...”创建)选择服务器排序规则,那么您的生活会变得更加轻松,并且您会遇到需要解决的排序规则冲突指定显式排序规则。这也会对性能产生影响。

"SQL Server 2008 Internals" has a good discussion on the topic imho.

Binary collation is tricky, if you intend to support text search for human beings, you'd better go with non-binary. Binary is good to gain a tiny bit of performance if you have tuned everything else (architecture first) and in cases where case sensitivity and accent sensitivity are a desired behavior, like password hashes for instance. Binary collation is actually "more precise" in a sense that it does not consider similar texts. The sort orders you get out of there are good for machines only though.

There is only a slight difference between the SQL_* collations and the native windows ones. If you're not constrained with compatibility, go for the native ones as they are the way forward afaik.

Collation decides sort order and equality. You choose, what really best suits your users. It's understood that you will use the unicode types (like nvarchar) for your data to support international text. Collation affects what can be stored in a non-unicode column, which does not affect you then.

What really matters is that you avoid mixing collations in WHERE clause because that's where you pay the fine by not using indexes. Afaik there's no silver bullet collation to support all languages. You can either choose one for the majority of your users or go into localization support with different column for each language.

One important thing is to have the server collation the same as your database collation. It will make your life much easier if you plan to use temporary tables as temporary tables if created with "CREATE TABLE #ttt..." pick up the server collation and you'd run into collation conflicts which you'll need to solve with specifying an explicit collation. This has a performance impact too.

青巷忧颜 2024-11-17 10:16:45

请不要认为我的答案是完整的,但您应该考虑以下几点:(

  • 正如 #Anthony 所说)所有文本字段都必须使用 nvarchar 数据类型。这将允许您存储任何语言的任何字符,如 UTF-8\unicode 字符集所定义!如果不这样做,您将无法在表格中混合不同来源的文本(拉丁语、西里尔语、阿拉伯语等)。

也就是说,您的排序规则选择将主要影响以下方面:

  • 在“e”和“é”或“c”和“ç”等字符之间设置的排序顺序或排序规则(它们是否应该被视为相等或不是?)。在某些情况下,整理序列确实会考虑特定的字母组合,就像在匈牙利语中一样,其中 C 和 CS,或 D、DZ 和 DZS 被独立考虑。
  • 分析空格(或其他非字母字符)的方式:哪一个是正确的“字母”顺序?

这个(空格被视为“第一等级”字符)?

San Juan
San Teodoro
Santa Barbara

或者这个(排序时不考虑空格)?

San Juan
Santa Barbara
San Teodoro
  • 排序规则还会影响大小写敏感性:大写字母是否必须被视为与小写字母相似?

Please do not consider my answer as complete, but you should take into consideration the following points:

  • ( as said by #Anthony) All text fields must use nvarchar data type. This will allow you to store any character from any language, as defined by UTF-8\unicode character set! If you do not do so, you will not be able to mix text from different origins (latin, cyrillic, arabic, etc) in your tables.

This said, your collation choice will mainly affect the following:

  • The collating sequence, or sorting rules to be set between characters such as 'e' and 'é', or 'c' and 'ç' (should they be considered as equal or not?). In some cases, collating sequences do consider specific letter combinations, just like in hungarian, where C and CS, or D, DZ and DZS, are considered independantly.
  • The way spaces (or other non letter characters) are analysed: which one is the correct 'alphabetical' order?

this one (spaces are considered as 'first rank' characters)?

San Juan
San Teodoro
Santa Barbara

or this one (spaces are not considered in the ordering)?

San Juan
Santa Barbara
San Teodoro
  • Collation also impacts on case sensitivity: do capital letters have to be considered as similar to small letters?
葬﹪忆之殇 2024-11-17 10:16:45

全球数据库(例如网站)的最佳默认排序规则可能是 Latin1_General_CI_AS。比排序规则更重要的是确保所有文本列都使用 nvarchar 数据类型。

The best default collation for a global database (e.g. a website) is probably Latin1_General_CI_AS. More important than collation is making sure that all textual columns use the nvarchar data type.

青瓷清茶倾城歌 2024-11-17 10:16:45

只要您使用 NVARCHAR 列(对于混合国际数据应该如此),所有 *_BIN 和 *_BIN2 排序规则都会基于 Unicode 代码点执行相同的二进制比较/排序。您选择哪一个并不重要。 Latin1_General_BIN2 看起来是一个合理的通用选择。

来源: http://msdn.microsoft.com/ en-us/library/ms143350(v=sql.105).aspx

As long as you use NVARCHAR columns (as you should for mixed international data), all *_BIN and *_BIN2 collations perform the same binary comparison/sorting based on the Unicode code points. It doesn't matter which one you pick. Latin1_General_BIN2 looks like a reasonable generic choice.

Source: http://msdn.microsoft.com/en-us/library/ms143350(v=sql.105).aspx

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文