如何在 Postgres 数据库中存储 UTF-16 字符?

发布于 2024-12-20 13:34:47 字数 509 浏览 7 评论 0原文

我试图在 Postgres 数据库中存储一些文本(例如 č),但是当检索该值时,它在屏幕上显示为 ?。我不确定它为什么这样做,我的印象是它是 UTF-8 中不支持的字符,但在 UTF-8 中,但是,从第一个答案来看,这是一个错误的假设。

原始问题(可能仍然有效):

我读过有关 UTF-8 代理对的内容,它可能会实现我的目标 require,我见过一些涉及 stringinfo 的示例 对象 TextElementEnumerators,但我无法找出一个实用的 概念证明。

有人可以提供一个如何编写和读取 UTF-16 的示例吗 (可能使用这个代理对概念)到 postgres 数据库。 谢谢。

更新的问题: 为什么 č 字符会以问号的形式从数据库中返回?

我们使用NPGSQL来访问数据库和VB.Net。

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was in UTF-8, however, judging by the first answer, this is an incorrect assumption.

Original question (which may still be valid):

I have read about UTF-8 Surrogate pairs, which may achieve what I
require, and I've seen a few examples involving the stringinfo
object TextElementEnumerators, but I couldn't work out a practical
proof of concept.

Can someone provide an example of how you would write and read UTF-16
(probably using this surrogate pair concept) to a postgres database.
Thank you.

Updated question:
Why would the č character be returned from the database as a question mark?

We use NPGSQL to access the database and VB.Net.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

裂开嘴轻声笑有多痛 2024-12-27 13:34:47

不存在 UTF-16 中存在但 UTF-8 中不存在的字符。两者都能够对所有 Unicode 进行编码。换句话说,如果你能让 UTF-8 工作,它应该能够存储任何有效的 Unicode 文本。

编辑:代理对实际上是UTF-16的一个功能比UTF-8。它们允许将不在基本多语言平面 (BMP) 中的字符表示为两个 UTF-16 代码单元。基本上,UTF-16 通常被视为固定宽度编码(每个 Unicode 字符正好两个字节),但这仅允许 BMP 干净地编码。代理对是一种将范围扩展到 BMP 之外的(相当老套的)方法。

我非常怀疑你试图代表的角色是否在 BMP 之外,所以我怀疑你需要在其他地方寻找问题。特别是,在文本进入数据库之前和获取文本之后,值得转储文本的确切字符值(例如,将每个 char 转换为 int)。理想情况下,在一个简短但完整的控制台应用程序中执行此操作。

There's no such thing as a character which exists in UTF-16 but not UTF-8. Both are capable of encoding all of Unicode. In other words, if you can get UTF-8 to work, it should be able to store any valid Unicode text.

EDIT: Surrogate pairs are actually a feature of UTF-16 rather than UTF-8. They allow a character which isn't in the basic multi-lingual plane (BMP) to be represented as two UTF-16 code units. Basically, UTF-16 is often treated as a fixed-width encoding (exactly two bytes per Unicode character) but that only allows the BMP to be encoded cleanly. Surrogate pairs are a (fairly hacky) way of extending the range beyond the BMP.

I very much doubt that the character you're trying to represent is outside the BMP, so I suspect you need to look elsewhere for the problem. In particular, it's worth dumping the exact character values of the text (e.g. by casting each char to int) before it goes into the database and after you've fetched it. Ideally, do this in a short but complete console app.

¢好甜 2024-12-27 13:34:47

如何将所有 UTF-16“字符”存储在 Postgres 数据库中?

简而言之,这不可能直接实现,因为 PostgreSQL 仅支持 UTF-8 字符集。

基于 UTF-16 的格式(例如 Java、JavaScript、Windows)可以包含在 UTF-8 或 UTF-32 中没有表示形式的半代理对。这些可以通过对 Java、JavaScript、VB.Net 字符串进行子串化来轻松创建。因为它们无法用UTF-8或UTF-32表示,因此无法存储在仅支持UTF-8字符集(如PostgreSQL)的数据库中。

Windows 路径名称可能包含半代理对,无法读取为 utf-8 ( https:/ /github.com/rust-lang/rust/issues/12056 )。

必须使用支持 UTF-16/CESU-8 字符集的数据库系统,该字符集更适合 Java/Android、JavaScript/NodeJS、.Net/wchar_t/Windows 语言/平台。
(SQLServer、Oracle(UTF-8 排序规则)、DB2、Informix、HANA、SQL Anywhere、MaxDB 通常支持此类字符集。

请注意,随着表情符号在基本多语言平面之外表示为 unicode 代码点,这些差异对于西方语言也将变得更加相关 在 postgres 上,

您可以:
a) 接受损失,
b) 将数据存储为二进制数据
或者
c) 将它们翻译成
编码表示(例如,JSON rfc 将它们编码为两个转义字符,以便能够在基于 UTF-8/Ascii 的网络格式中传输半代理而不丢失(https://www.rfc-editor.org/rfc/rfc4627 部分2.5)。

例如,表情符号位于基本多语言平面之外,这个问题在西方世界也将变得更加相关,

具体取决于应用程序服务器(Java、Scala、C#/Windows、JavaScript/NodeJS)与 go 的选择。对语言支持的投资水平(例如在字素边界使用 ICU 字符串分割函数(https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)而不是简单的截断,这个问题可能不太相关,但大多数企业系统和语言都属于这个问题。今天UTF-16阵营用软件使用了简单的分串操作。

How can I store all UTF-16 "characters" in a Postgres database?

Short answer, this is not directly possible as PostgreSQL only supports a UTF-8 character set.

UTF-16 based formats like Java, JavaScript, Windows can contain half surrogate pairs which have no representation in UTF-8 or UTF-32. These may easily be created by sub-stringing a Java, JavaScript, VB.Net string. As they cannot be represented in UTF-8 or UTF-32 and thus cannot be stored in a database which only supports an UTF-8 character set like PostgreSQL.

Windows Path names may contain half surrogate pairs which cannot be read as utf-8 ( https://github.com/rust-lang/rust/issues/12056 ).

One would have to use database system which supports a UTF-16/CESU-8 character set which is more adapted to Java/Android, JavaScript/NodeJS, .Net/wchar_t/Windows languages/platforms.
(SQLServer, Oracle (UTF-8 collation), DB2, Informix, HANA, SQL Anywhere, MaxDB typically support such a charset.

Note that with emoticons being represented as unicode codepoints outside the Basic Multilingual Plane these differences will become more relevant also for western users.

On postgres you may:
a) Accept the losses,
b) Store the data as binary data
or
c) translate them to an
encoded representation (e.g. the JSON rfc encodes them as two escaped characters to be able to transport half surrogates within an UTF-8/Ascii based network format without loss (https://www.rfc-editor.org/rfc/rfc4627 Section 2.5).

With e.g. emoticons being located outside the Basic multilingual plane this problem will become more relevant also in the western world.

Depending on the pick of language Application Server ( Java,Scala, C#/Windows, JavaScript/NodeJS) vs go and the level of investment into language support (using e.g. ICU string splitting functions at grapheme boundaries (https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) instead of simple truncation the issue may be less relevant. But the majority of enterprise systems and languages fall in the UTF-16 camp today, with software using a simple sub-string operations.

动次打次papapa 2024-12-27 13:34:47

关于存储/检索问题 č

  1. 检查 Postgre 数据库运行的字符集是否为 UTF-8
    字符集
    (https://www.postgresql.org/docs/9.1/multibyte.html ) 或可以表示该字符的字符集。

  2. 检查客户端与数据库的连接是否设置为
    执行适当的代码页转换(对于 VB.Net 这将
    从 UTF-16LE 到 UTF-8 或数据库字符集,这通常是
    连接字符串(字符集)上的参数。

  3. 检查输入是否是 VB.net 字节序列中的实际 UTF-8 / UTF-16,而不是 Windows-1250 字节序列。

  4. 检查这不仅仅是输出工具的限制或
    控制台(例如,Windows 控制台通常不显示 unicode 字符,而是使用 Windows-12xx 字符集(可以尝试 https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8),但通常最好在 VB.Net 调试器中检查字节序列

  5. 检查 CHAR/VARCHAR 列的长度是否足以存储您的表示,即使以 NFKD 分解表示也是如此

您指示的字素有几种不同的 unicode 表示

 U+010D LATIN SMALL LETTER C WITH CARON
 U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON

以及其他字符集的不同表示(例如 0xE8)。
ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/ Windows-1250)或
ISO-8859-13 /Windows-1257。

所有 unicode 表示形式都属于基本多语言平面,因此问题标题中所示并在下面回答的 postgre 的 UTF-16 代理问题可能与您的问题无关。

As to the problem storing/retrieving č

  1. Check the character set the Postgre db is running on is UTF-8
    character set
    (https://www.postgresql.org/docs/9.1/multibyte.html ) or a character set which can represent the character.

  2. Check that the client connection to the database is set up to
    perform the appropriate codepage conversion ( for VB.Net this would
    be from UTF-16LE to UTF-8 or the database charset, this is typically
    a parameter on the connection string (charset) ).

  3. Check that the input is the actual UTF-8 / UTF-16 in VB.net byte sequence, not the Windows-1250 byte sequence.

  4. Check that this is not simply a limitation of the output tool or
    console (e.g. a Windows console typically does not display unicode characters but uses Windows-12xx character set (one can try https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8) but typically inspecting the byte sequence in a VB.Net debugger is best.

  5. Check that the length of the CHAR/VARCHAR column is sufficient to store your representation, even if represented in NFKD decomposition.

The grapheme you indicate has several different unicode representations.

 U+010D LATIN SMALL LETTER C WITH CARON
 U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON

And a different representations other character sets (e.g. 0xE8 in
ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/Windows-1250) or
ISO-8859-13 /Windows-1257.

All unicode representations fall into the basic multilingual plane, so the UTF-16 surrogate issue with postgre as indicated in the question title and answered below is likely irrelevant to your problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文