在 SQL Server 中存储 UTF-16/Unicode 数据

发布于 2024-07-18 10:57:37 字数 462 浏览 4 评论 0原文

根据,SQL Server 2K5在内部使用UCS-2 。 它可以在 UCS-2 中存储 UTF-16 数据(具有适当的数据类型,nchar 等),但是如果有补充字符,则会存储为 2 个 UCS-2 字符。

这给字符串函数带来了一个明显的问题,即 SQL Server 将 1 个字符视为 2 个字符。

让我有些惊讶的是,SQL Server 基本上只能处理 UCS-2,更令人惊讶的是,这在 SQL 2K8 中并未得到修复。 我确实意识到其中一些角色可能并不那么常见。

除了本文中建议的函数之外,还有有关在 SQL Server 2K5 中处理(损坏的)字符串函数和 UTF-16 数据的最佳方法的任何建议。

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

小姐丶请自重 2024-07-25 10:57:37

SQL Server 2012 现在支持 UTF-16,包括代理项对。 请参阅 http://msdn.microsoft.com/en -us/library/ms143726(v=sql.110).aspx,尤其是“补充字符”部分。

因此,解决最初问题的一种方法是采用 SQL Server 2012。

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

So one fix for the original problem is to adopt SQL Server 2012.

鹊巢 2024-07-25 10:57:37

字符串函数可以很好地处理 unicode 字符串; 关心字符数的人将两字节字符视为单个字符,而不是两个字符。 唯一需要注意的是 len() 和 datalength(),它们在使用 unicode 时返回不同的值。 当然,它们返回正确的值 - len() 返回以字符为单位的长度,而 datalength() 返回以字节为单位的长度。 它们只是由于两字节字符而恰好不同。

因此,只要您在代码中使用正确的函数,一切都应该透明地工作。

编辑:只需仔细检查在线图书,自 SQL Server 2000 以来,unicode 数据就可以与字符串函数完美配合。

编辑 2:正如评论中指出的,SQL Server 的字符串函数不支持完整的 Unicode 字符集,因为缺乏对解析平面 0 之外的代理(或者,换句话说,SQL Server 的字符串函数只能识别每个字符最多 2 个字节。)SQL Server 将正确存储和返回数据,但是任何依赖于字符计数的字符串函数都不会返回预期值。 绕过此问题的最常见方法似乎是在 SQL Server 外部处理字符串,或者使用 CLR 集成添加 Unicode 感知字符串处理函数。

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

魂归处 2024-07-25 10:57:37

需要补充的是,我刚刚学到了困难的方法:

如果您在 oracle 中使用“n”字段(我正在运行 9i),并通过 .net oracleclient 访问它,似乎只有参数化的 sql 才能工作... N如果你有一些内联sql,'string' unicode 前缀似乎不起作用。

我所说的“工作”是指:它将丢失基本字符集不支持的任何字符。 所以在我的例子中,英语字符工作正常,西里尔字母变成问号/垃圾。

这是关于该主题的更全面的讨论: http://forums.oracle.com /forums/thread.jspa?threadID=376847

想知道是否可以在连接字符串或其他内容中设置 ORA_NCHAR_LITERAL_REPLACE 变量。

something to add, that I just learned the hard way:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文