如何在 Postgres 数据库中存储 UTF-16 字符？

发布于 2024-12-20 13:34:47 字数 509 浏览 7 评论 0原文

我试图在 Postgres 数据库中存储一些文本（例如 č），但是当检索该值时，它在屏幕上显示为 ?。我不确定它为什么这样做，我的印象是它是 UTF-8 中不支持的字符，但在 UTF-8 中，但是，从第一个答案来看，这是一个错误的假设。

原始问题（可能仍然有效）：

我读过有关 UTF-8 代理对的内容，它可能会实现我的目标 require，我见过一些涉及 stringinfo 的示例对象 TextElementEnumerators，但我无法找出一个实用的概念证明。
有人可以提供一个如何编写和读取 UTF-16 的示例吗（可能使用这个代理对概念）到 postgres 数据库。谢谢。

更新的问题：为什么 č 字符会以问号的形式从数据库中返回？

我们使用NPGSQL来访问数据库和VB.Net。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

裂开嘴轻声笑有多痛 2024-12-27 13:34:47

不存在 UTF-16 中存在但 UTF-8 中不存在的字符。两者都能够对所有 Unicode 进行编码。换句话说，如果你能让 UTF-8 工作，它应该能够存储任何有效的 Unicode 文本。

编辑：代理对实际上是UTF-16的一个功能比UTF-8。它们允许将不在基本多语言平面 (BMP) 中的字符表示为两个 UTF-16 代码单元。基本上，UTF-16 通常被视为固定宽度编码（每个 Unicode 字符正好两个字节），但这仅允许 BMP 干净地编码。代理对是一种将范围扩展到 BMP 之外的（相当老套的）方法。

我非常怀疑你试图代表的角色是否在 BMP 之外，所以我怀疑你需要在其他地方寻找问题。特别是，在文本进入数据库之前和获取文本之后，值得转储文本的确切字符值（例如，将每个 char 转换为 int）。理想情况下，在一个简短但完整的控制台应用程序中执行此操作。

回复收藏 0 原文

￠好甜 2024-12-27 13:34:47

如何将所有 UTF-16“字符”存储在 Postgres 数据库中？

简而言之，这不可能直接实现，因为 PostgreSQL 仅支持 UTF-8 字符集。

基于 UTF-16 的格式（例如 Java、JavaScript、Windows）可以包含在 UTF-8 或 UTF-32 中没有表示形式的半代理对。这些可以通过对 Java、JavaScript、VB.Net 字符串进行子串化来轻松创建。因为它们无法用UTF-8或UTF-32表示，因此无法存储在仅支持UTF-8字符集（如PostgreSQL）的数据库中。

Windows 路径名称可能包含半代理对，无法读取为 utf-8 ( https:/ /github.com/rust-lang/rust/issues/12056 ）。

必须使用支持 UTF-16/CESU-8 字符集的数据库系统，该字符集更适合 Java/Android、JavaScript/NodeJS、.Net/wchar_t/Windows 语言/平台。
（SQLServer、Oracle（UTF-8 排序规则）、DB2、Informix、HANA、SQL Anywhere、MaxDB 通常支持此类字符集。

请注意，随着表情符号在基本多语言平面之外表示为 unicode 代码点，这些差异对于西方语言也将变得更加相关在 postgres 上，

您可以：
a) 接受损失，
b) 将数据存储为二进制数据
或者
c) 将它们翻译成
编码表示（例如，JSON rfc 将它们编码为两个转义字符，以便能够在基于 UTF-8/Ascii 的网络格式中传输半代理而不丢失（https://www.rfc-editor.org/rfc/rfc4627 部分2.5）。

例如，表情符号位于基本多语言平面之外，这个问题在西方世界也将变得更加相关，

具体取决于应用程序服务器（Java、Scala、C#/Windows、JavaScript/NodeJS）与 go 的选择。对语言支持的投资水平（例如在字素边界使用 ICU 字符串分割函数（https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries）而不是简单的截断，这个问题可能不太相关，但大多数企业系统和语言都属于这个问题。今天UTF-16阵营用软件使用了简单的分串操作。

How can I store all UTF-16 "characters" in a Postgres database?

Short answer, this is not directly possible as PostgreSQL only supports a UTF-8 character set.

UTF-16 based formats like Java, JavaScript, Windows can contain half surrogate pairs which have no representation in UTF-8 or UTF-32. These may easily be created by sub-stringing a Java, JavaScript, VB.Net string. As they cannot be represented in UTF-8 or UTF-32 and thus cannot be stored in a database which only supports an UTF-8 character set like PostgreSQL.

Windows Path names may contain half surrogate pairs which cannot be read as utf-8 ( https://github.com/rust-lang/rust/issues/12056 ).

One would have to use database system which supports a UTF-16/CESU-8 character set which is more adapted to Java/Android, JavaScript/NodeJS, .Net/wchar_t/Windows languages/platforms.
(SQLServer, Oracle (UTF-8 collation), DB2, Informix, HANA, SQL Anywhere, MaxDB typically support such a charset.

Note that with emoticons being represented as unicode codepoints outside the Basic Multilingual Plane these differences will become more relevant also for western users.

On postgres you may:
a) Accept the losses,
b) Store the data as binary data
or
c) translate them to an
encoded representation (e.g. the JSON rfc encodes them as two escaped characters to be able to transport half surrogates within an UTF-8/Ascii based network format without loss (https://www.rfc-editor.org/rfc/rfc4627 Section 2.5).

With e.g. emoticons being located outside the Basic multilingual plane this problem will become more relevant also in the western world.

Depending on the pick of language Application Server ( Java,Scala, C#/Windows, JavaScript/NodeJS) vs go and the level of investment into language support (using e.g. ICU string splitting functions at grapheme boundaries (https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) instead of simple truncation the issue may be less relevant. But the majority of enterprise systems and languages fall in the UTF-16 camp today, with software using a simple sub-string operations.

回复收藏 0 原文

动次打次papapa 2024-12-27 13:34:47

关于存储/检索问题 č

检查 Postgre 数据库运行的字符集是否为 UTF-8
字符集
(https://www.postgresql.org/docs/9.1/multibyte.html ) 或可以表示该字符的字符集。
检查客户端与数据库的连接是否设置为
执行适当的代码页转换（对于 VB.Net 这将
从 UTF-16LE 到 UTF-8 或数据库字符集，这通常是
连接字符串（字符集）上的参数。
检查输入是否是 VB.net 字节序列中的实际 UTF-8 / UTF-16，而不是 Windows-1250 字节序列。
检查这不仅仅是输出工具的限制或
控制台（例如，Windows 控制台通常不显示 unicode 字符，而是使用 Windows-12xx 字符集（可以尝试 https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8），但通常最好在 VB.Net 调试器中检查字节序列
检查 CHAR/VARCHAR 列的长度是否足以存储您的表示，即使以 NFKD 分解表示也是如此

您指示的字素有几种不同的 unicode 表示

 U+010D LATIN SMALL LETTER C WITH CARON
 U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON

以及其他字符集的不同表示（例如 0xE8）。
ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/ Windows-1250）或
ISO-8859-13 /Windows-1257。

所有 unicode 表示形式都属于基本多语言平面，因此问题标题中所示并在下面回答的 postgre 的 UTF-16 代理问题可能与您的问题无关。

As to the problem storing/retrieving č

Check the character set the Postgre db is running on is UTF-8
character set
(https://www.postgresql.org/docs/9.1/multibyte.html ) or a character set which can represent the character.
Check that the client connection to the database is set up to
perform the appropriate codepage conversion ( for VB.Net this would
be from UTF-16LE to UTF-8 or the database charset, this is typically
a parameter on the connection string (charset) ).
Check that the input is the actual UTF-8 / UTF-16 in VB.net byte sequence, not the Windows-1250 byte sequence.
Check that this is not simply a limitation of the output tool or
console (e.g. a Windows console typically does not display unicode characters but uses Windows-12xx character set (one can try https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8) but typically inspecting the byte sequence in a VB.Net debugger is best.
Check that the length of the CHAR/VARCHAR column is sufficient to store your representation, even if represented in NFKD decomposition.

The grapheme you indicate has several different unicode representations.

 U+010D LATIN SMALL LETTER C WITH CARON
 U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON

And a different representations other character sets (e.g. 0xE8 in
ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/Windows-1250) or
ISO-8859-13 /Windows-1257.

All unicode representations fall into the basic multilingual plane, so the UTF-16 surrogate issue with postgre as indicated in the question title and answered below is likely irrelevant to your problem.

回复收藏 0 原文

~没有更多了~