将 C# 字符串 (UTF-16) 存储在 SQL Server nvarchar (UCS-2) 列中会产生什么后果?
SQL Server 似乎对 nchar/nvarchar
字段使用 Unicode UCS-2(一种 2 字节固定长度字符编码)。同时,C#对其字符串使用 Unicode UTF-16 编码(注意:有些人不认为 UCS-2 是 Unicode,但它在 Unicode 中对所有与 UTF-16 相同的代码点进行编码)子集 0-0xFFFF,就 SQL Server 而言,这是它在字符串方面本机支持的最接近“Unicode”的东西。)
而 UCS-2 在基本多语言中编码与 UTF-16 相同的基本代码点Plane (BMP),它不保留 UTF-16 允许代理项对的某些位模式。
如果我将 C# 字符串写入 SQL Server nvarchar
(UCS-2) 字段并读回,这是否总是返回相同的结果?
看起来,虽然 UTF-16 是 UCS-2 的超集,因为 UTF-16 编码了更多代码点(例如高于 0xFFFF),但它实际上是 UCS-2 在 2 字节级别的子集,因为它是更具限制性。
为了回答我自己的问题,我怀疑如果我的 C# 字符串包含高于 0xFFFF 的代码点(由字符对表示),这些代码点将在数据库中很好地存储和检索,但如果我尝试在数据库中操作它们(例如也许调用 TOUPPER 或尝试清空所有其他字符),那么我可能会在稍后显示字符串时遇到一些问题...除非 SQL Server 具有确认代理项对并有效处理 nchar/nvarchar 的函数字符串为 UTF-16。
It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar
fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)
While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.
If I write a C# string to an SQL Server nvarchar
(UCS-2) field and read it back, will this always return the same result?
It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.
To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar
strings as UTF-16.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这真的有点胡说八道。
首先是相似之处
nchar
/nvarchar
/ntext
数据类型将文本存储为 2 字节的字符串人物。它并不真正关心您在其中放入什么,直到您进行搜索和排序(然后它使用适当的 Unicode 排序规则序列)。String
数据类型还将文本存储为 2 字节Char
的字符串。它也并不真正关心你在其中放入什么,直到你进行搜索和排序(然后它使用适当的特定于文化的方法)。现在的区别
String
时,它始终将字符串编码为 UTF-16(具有完整的多语言平面支持)。简而言之,只要将 CLR 和 SQL Server 字符串变量视为整个文本块,那么您就可以自由地从一个变量分配到另一个变量,而不会丢失信息。尽管顶层的抽象略有不同,但底层存储格式完全相同。
It's all a bit of a fudge really.
First the similarities
nchar
/nvarchar
/ntext
data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).String
data type also stores text as a string of 2-byteChar
s. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).Now the differences
String
, it will always encode the string as UTF-16 (with full multilingual plane support).In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.
我不认为将文本视为 UCS-2 会导致很多问题。
大小写转换不应该成为问题,因为(据我所知)BMP 之上没有大小写映射(当然,标识映射除外!),并且显然,代理字符将映射到其自身。
清空所有其他角色只是自找麻烦。事实上,在不考虑角色价值观的情况下进行此类转变始终是一项危险的活动。我可以看到它通过字符串截断合法地发生。但如果结果中出现任何不匹配的代理,这本身并不是一个大问题。任何接收此类数据并关心此类数据的系统可能只会用替换字符替换不匹配的代理,如果它愿意对此做任何事情的话。
显然,字符串长度将是字节/2,而不是字符数,但一旦您开始深入研究 Unicode 代码图表,字符数就不是一个非常有用的值。例如,一旦离开 ASCII 范围,由于字符、RTL 语言、方向控制字符、标签和多种空格字符的组合,您将无法在等宽显示中获得良好的结果。高代码点将是您遇到的问题中最少的。
为了安全起见,您可能应该将楔形文字文本存储在与考古学家姓名不同的列中。 :D
现在用经验数据更新!
我刚刚进行了一个测试,看看案例转换会发生什么。我创建了一个字符串,其中英语单词 TEST 两次大写 - 首先是拉丁字母,然后是 Deseret 字母。我在 .NET 和 SQL Server 中对此字符串应用了小写转换。
.NET 版本正确地小写了两个脚本中的所有字母。 SQL Server 版本仅将拉丁字符小写,而 Deseret 字符保持不变。这符合对 UTF-16 与 UCS-2 的处理的期望。
输出:
以防万一有人安装了 Deseret 字体,以下是实际的字符串供您欣赏:
I don't expect that treating the text as UCS-2 would cause many problems.
Case conversions should not be a problem, because (AFAIK) there are no case mappings above the BMP (except the identity mapping, of course!), and, obviously, the surrogate characters are going to map to themselves.
Blanking every other character is just asking for trouble. In reality, doing these sorts of transformations without consideration of the character values is always a dangerous activity. I can see it happening legitimately with string truncations. But if any unmatched surrogates show up in the result, this itself is not a huge problem. Any system that receives such data—and cares—will probably just replace the unmatched surrogate with a replacement character, if it bothers to do anything about it at all.
Obviously, string length is going to be bytes/2 rather than number-of-characters, but number-of-characters is not a very useful value anyway, once you start plumbing the depths of the Unicode code charts. For example, you aren't going to get good results in monospaced display once you leave the ASCII range, because of combining characters, RTL languages, directional control characters, tags, and several kinds of space characters. The high code points are going to be the least of your problems.
Just to be on the safe side, you should probably store your cuneiform texts in a different column than the archeologist's names. :D
UPDATE now with empirical data!
I just ran a test to see what happens with case transformations. I created a string with the English word TEST in uppercase twice—first in Latin script, then in Deseret script. I applied a lower-case transformation to this string in .NET and in SQL Server.
The .NET version correctly lowercased all the letters in both scripts. The SQL Server version only lowercased the Latin characters and left the Deseret characters unchanged. This meets with expectations regarding the handling of UTF-16 verses UCS-2.
Output:
Just in case anyone has a Deseret font installed, here are the actual strings for your enjoyment: