UCS-2 和 SQL Server

发布于 2024-12-28 20:16:26 字数 781 浏览 8 评论 0原文

研究用于存储大部分英语的选项但有时 SQL Server 数据库中的数据可能非常大,我倾向于将大多数字符串数据存储为 UTF-8 编码。

然而,微软选择 UCS-2 的原因我并不完全理解,这让我对这种倾向产生了怀疑。 SQL Server 2012 的文档确实显示了如何创建 UTF-8 UDT,但 UCS-2 的决定可能遍及 SQL Server。

维基百科(有趣的是,UCS-2 已过时,取而代之的是 UTF-16)指出UTF-8 是一种可变宽度字符集,能够对任何 Unicode 数据点进行编码,并且它为 Unicode 文本的交换提供事实上的标准编码。因此,感觉任何 Unicode 字符都可以用 UTF-8 表示,并且由于大多数文本都是英语,因此表示形式将比 UCS-2 紧凑两倍(我知道磁盘“便宜”,但磁盘缓存并不不是,并且内存与我正在处理的数据大小相比并不存在。当工作集大于可用 RAM 时,许多操作会呈指数级下降。

在 UCS-2 流中游泳可能会遇到哪些问题?

While researching options for storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

However, Microsoft chose UCS-2 for reasons that I don't fully understand which is causing me to second-guess that leaning. The documentation for SQL Server 2012 does show how to create a UTF-8 UDT, but the decision for UCS-2 presumably pervades SQL Server.

Wikipedia (which interestingly notes that UCS-2 is obsolete in favor of UTF-16) notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point and that it provides the de facto standard encoding for interchange of Unicode text. So, it feels like any Unicode character can be represented in UTF-8, and since most text will be English, the representation will be nearly twice as compact as with UCS-2 (I know disk is "cheap", but disk cache isn't, and memory isn't in comparison to the data sizes I'm dealing with. Many operations degrade exponentially when the working set is larger than available RAM).

What problems might I encounter by swimming up the UCS-2 stream?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一片旧的回忆 2025-01-04 20:16:26

在可能非常大的 SQL Server 数据库中存储大部分是英语但有时不是的数据,我倾向于将大多数字符串数据存储为 UTF-8 编码。

与其他允许选择编码的 RDBMS 不同,SQL Server 仅以 UTF-16(Little Endian)存储 Unicode 数据,以 8 位编码(扩展 ASCII、DBCS、或 EBCDIC),用于字段排序规则隐含的任何代码页。

微软选择 UCS-2 的原因我并不完全理解

考虑到 UTF-16 于 1996 年中期引入并于 2000 年全面指定,他们选择 UCS-2 的决定是很有意义的。其他系统也使用(或使用过)它(请参阅:https://en.wikipedia.org/wiki/UTF-16#Usage)。他们继续使用它的决定可能更值得怀疑,尽管这可能是由于 Windows 和 .NET 是 UTF-16。 UCS-2 和 UTF-16 之间的字节物理布局相同,因此从 UCS-2 升级系统以支持 UTF-16 应该是纯功能性的,无需更改任何现有数据。

SQL Server 2012 的文档确实显示了如何创建 UTF-8 UDT,

嗯,没有。无论如何,通过 SQLCLR 创建自定义用户定义类型不会让您替换任何本机类型。它对于创建处理专门数据的东西非常方便。但字符串,即使是不同的编码,也远非专业化。对字符串数据采用这种方法会破坏系统的任何可用性,更不用说性能了,因为您将无法使用任何内置字符串函数。如果您能够节省磁盘空间上的任何内容,那么您所损失的整体性能就会抹去这些收益。存储 UDT 是通过将其序列化为 VARBINARY 来完成的。因此,为了进行任何字符串比较或排序,除了“二进制”/“序数”比较之外,您必须将所有其他值一一转换回 UTF-8进行可以考虑语言差异的字符串比较。

另外,该“文档”实际上只是示例代码/概念证明。该代码编写于 2003 年 ( http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs)用于 SQL Server 2005。我看到了一个测试功能的脚本,但不涉及性能。

但 UCS-2 的决定可能会影响到 SQL Server。

是的,非常如此。默认情况下,内置函数的处理仅适用于 UCS-2。但从 SQL Server 2012 开始,您可以通过使用以下排序规则之一让它们处理完整的 UTF-16 字符集(从 Unicode 版本 5 或 6 开始,具体取决于您的操作系统和 .NET Framework 版本)名称以_SC 结尾(即补充字符)。

维基百科...指出 UCS-2 已过时,转而采用 UTF-16

正确。 UTF-16 和 UCS-2 都使用 2 字节代码点。但 UTF-16 成对使用其中一些字符(即代理对)来映射其他字符。用于这些对的码点在 UCS-2 中为此目的而保留,因此不用于映射到任何可用符号。这就是为什么您可以在 SQL Server 中存储任何 Unicode 字符,并且可以正确存储和检索该字符。

维基百科...指出 UTF-8 是一种可变宽度字符集,能够对任何 Unicode 数据点进行编码

正确,但具有误导性。是的,UTF-8 是可变宽度的,但 UTF-16 的可变宽度也很小,因为所有补充字符均由两个双字节代码点组成。因此,UTF-16 每个符号使用 2 或 4 个字节,但 UCS-2 始终为 2 个字节。但这不是误导性的部分。具有误导性的是,任何其他 Unicode 编码都无法对所有其他代码点进行编码。虽然 UCS-2 可以保存它们但不能解释它们,但 UTF-16 和 UTF-32 都可以映射所有 Unicode 代码点,就像 UTF-8 一样。

并且它[编辑:UTF-8] 为 Unicode 文本的交换提供了事实上的标准编码。

这可能是真的,但从操作角度来看,这完全无关紧要。

感觉任何 Unicode 字符都可以用 UTF-8 表示

,确实如此,但完全不相关,因为 UTF-16 和 UTF-32 也映射所有 Unicode 代码点。

由于大多数文本都是英文,因此其表示形式将比 UCS-2 紧凑两倍

具体取决于具体情况,这很可能是正确的,并且您对这种浪费的使用感到担忧是正确的。然而,正如我在导致这个问题的问题中提到的( UTF-8 支持、SQL Server 2012 和 UTF8String UDT ),您有几个选项可以缓解如果大多数行可以放入 VARCHAR,但有些行需要放入 NVARCHAR,则会浪费大量空间。最好的选择是启用行压缩或页压缩(仅限企业版!)。从 SQL Server 2008 R2 开始,它们允许非 MAX NVARCHAR 字段使用“Unicode 标准压缩方案”,该方案至少与 UTF-8 一样好,在某些情况下甚至更好比UTF-8。 NVARCHAR(MAX) 字段无法使用这种奇特的压缩,但它们的 IN ROW 数据可以受益于常规 ROW 和/或 PAGE 压缩。请参阅以下内容,了解此压缩的说明以及比较数据大小的图表:原始 UCS-2 / UTF-16、UTF-8 和启用数据压缩的 UCS-2 / UTF-16。

SQL Server 2008 R2 - UCS2 压缩是什么 - 对 SAP 系统的影响

另请参阅 MSDN 数据压缩页面了解更多详细信息,因为存在一些限制(超出它仅在企业版中可用 - 但从 SQL Server 2016 SP1 开始可用于所有版本!!)并且在某些情况下压缩可能会使情况变得更糟。

我知道磁盘“便宜”

该说法的准确性取决于人们如何定义“磁盘”。如果您指的是可以在商店购买现成的用于台式机/笔记本电脑的商品零件,那么当然可以。但是,如果就将用于您的生产系统的企业级存储而言,那么请有趣地向控制预算的人解释他们不应该拒绝您想要的价值数百万美元以上的 SAN,因为它“便宜” “;-)。

在 UCS-2 流中游泳可能会遇到什么问题?

我想不到。好吧,只要您不遵循任何可怕的建议来执行诸如实现 UDT、将所有字符串转换为 VARBINARY 或使用 NVARCHAR(MAX) 等操作对于所有字符串字段;-)。但在您可能担心的所有事情中,使用 UCS-2 / UTF-16 的 SQL Server 不应该是其中之一。

但是,如果由于某种原因,不支持 UTF-8 的问题非常重要,那么您可能需要找到另一个允许 UTF-8 的 RDBMS 来使用。


更新 2018-10-02

虽然这还不是一个可行的选择,但 SQL Server 2019 在 VARCHAR / CHAR 中引入了对 UTF-8 的本机支持数据类型。目前它存在太多错误,无法使用,但如果修复了这些错误,那么这是某些场景的一个选项。请参阅我的帖子“SQL Server 2019 中的原生 UTF-8 支持:救世主还是假先知?",对此进行了详细分析新功能。

storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

Unlike some other RDBMS's that allow for choosing an encoding, SQL Server stores Unicode data only in UTF-16 (Little Endian), and non-Unicode data in an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC) for whatever Code Page is implied by the Collation of the field.

Microsoft chose UCS-2 for reasons that I don't fully understand

Their decision to choose UCS-2 makes sense enough given that UTF-16 was introduced in mid-1996 and fully specified in 2000. A lot of other systems use (or used) it as well (please see: https://en.wikipedia.org/wiki/UTF-16#Usage). Their decision to continue with it might be more questionable, though it is probably due to Windows and .NET being UTF-16. The physical layout of the bytes is the same between UCS-2 and UTF-16, so upgrading systems from UCS-2 to support UTF-16 should be purely functional with no need to alter any existing data.

The documentation for SQL Server 2012 does show how to create a UTF-8 UDT,

Um, no. Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions. If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a VARBINARY. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences.

Also, that "documentation" is really just sample code / proof of concept stuff. The code was written in 2003 ( http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs ) for SQL Server 2005. I saw a script to test functionality, but nothing involving performance.

but the decision for UCS-2 presumably pervades SQL Server.

Yes, very much so. By default, the handling of the built-in functions is only for UCS-2. But starting in SQL Server 2012, you can get them to handle the full UTF-16 character set (well, as of Unicode Version 5 or 6, depending on your OS and version of the .NET Framework) by using one of the collations that has a name ending in _SC (i.e. Supplementary Characters).

Wikipedia ... notes that UCS-2 is obsolete in favor of UTF-16

Correct. UTF-16 and UCS-2 both use 2-byte code points. But UTF-16 uses some of them in pairs (i.e. Surrogate Pairs) to map additional characters. The code points used for these pairs are reserved for this purpose in UCS-2 and hence are not used to map to any usable symbols. This is why you can store any Unicode character in SQL Server and it will be stored and retrieved correctly.

Wikipedia ... notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point

Correct, though misleading. Yes, UTF-8 is variable-width, but UTF-16 is also minorly variable since all of the Supplementary Characters are composed of two double-byte code points. Hence UTF-16 uses either 2 or 4 bytes per symbol, though UCS-2 is always 2 bytes. But that is not the misleading part. What is misleading is the implication that any other Unicode encoding isn't capable of encoding all other code points. While UCS-2 can hold them but not interpret them, both UTF-16 and UTF-32 can both map all Unicode code points, just like UTF-8.

and that it [ed: UTF-8] provides the de facto standard encoding for interchange of Unicode text.

This may be true, but it is entirely irrelevant from an operational perspective.

it feels like any Unicode character can be represented in UTF-8

Again, true, but entirely irrelevant since UTF-16 and UTF-32 also map all Unicode code points.

since most text will be English, the representation will be nearly twice as compact as with UCS-2

Depending on circumstances this could very well be true, and you are correct to be concerned about such wasteful usage. However, as I mentioned in the question that lead to this one ( UTF-8 Support, SQL Server 2012 and the UTF8String UDT ), you have a few options to mitigate the amount of space wasted if most rows can fit into VARCHAR yet some need to be NVARCHAR. The best option is to enable ROW COMPRESSION or PAGE COMPRESSION (Enterprise Editon only!). Starting in SQL Server 2008 R2, they allow non-MAX NVARCHAR fields to use the "Standard Compression Scheme for Unicode" which is at least as good as UTF-8, and in some cases it is even better than UTF-8. NVARCHAR(MAX) fields cannot use this fancy compression, but their IN ROW data can benefit from regular ROW and/or PAGE Compression. Please see the following for a description of this compression and a chart comparing data sizes for: raw UCS-2 / UTF-16, UTF-8, and UCS-2 / UTF-16 with data compression enabled.

SQL Server 2008 R2 - UCS2 compression what is it - Impact on SAP systems

Please also see the MSDN page for Data Compression for more details as there are some restrictions (beyond it being available only in Enterprise Edition -- BUT made available to all editions starting with SQL Server 2016, SP1 !!) and some circumstances when compression might make things worse.

I know disk is "cheap"

The veracity of that statement depends on how one defines "disk". If you are speaking in terms of commodity parts that you can purchase off the shelf at a store for use in your desktop / laptop, then sure. But, if speaking in terms of enterprise-level storage that will be used for your Production systems, then have fun explaining to whomever controls the budget that they shouldn't reject the million-plus-dollar SAN that you want because it is "cheap" ;-).

What problems might I encounter by swimming up the UCS-2 stream?

None that I can think of. Well, as long as you don't follow any horrible advice to do something like implementing that UDT, or converting all of the strings to VARBINARY, or using NVARCHAR(MAX) for all string fields ;-). But of all of the things you could worry about, SQL Server using UCS-2 / UTF-16 shouldn't be one of them.

But, if for some reason this issue of no native support for UTF-8 is super important, then you might need to find another RDBMS to use that does allow for UTF-8.


UPDATE 2018-10-02

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

遇见了你 2025-01-04 20:16:26

“沿着 UCS-2 流向上游”是什么意思?

以下是您的选择:

不推荐但可能:

  • 实施 UDT。这将是一项繁重的工作,并且您将失去收费支持(OR 映射,当然还有一些适用于本机类型的 SQL Server 功能)。
  • 使用 varbinary(max):需要您执行自定义转换代码。无范围索引。
  • 使用 nvarchar(N) 并打开行压缩。从 SQL Server 2008 R2 开始,这将使用与 UTF-8 一样紧凑的编码。但这需要企业版。

请参阅评论以了解这些方法的严重缺点。

What do you mean by "swimming up the UCS-2 stream"?

Here are your options:

Not recommended but possible:

  • Implement a UDT. This will be a lot of work and you will loose tolling support (OR mapping and certainly some SQL Server features that work on native types).
  • Use varbinary(max): Requires you to do custom conversion code. No range indexing.
  • Use nvarchar(N) and turn on row compression. Starting with SQL Server 2008 R2 this will use an encoding that is as compact as UTF-8. But this requires enterprise edition.

See the comments to read about the severe drawbacks that these approaches have.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文