PostgreSQL 9.1 在 select 语句中使用 collate

发布于 2024-12-10 15:08:43 字数 1405 浏览 0 评论 0原文

我有一个 postgresql 9.1 数据库表“en_US.UTF-8”：

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

属性 name_language 包含各种语言的名称。语言由外键 language_id 指定。

我创建了一些索引：

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

现在，当我进行选择时，我没有得到我期望的结果。

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

这会生成一个名称列表，但不按我预期的顺序：

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

正如我预期的那样，最后两个条目出现在列表中的不同位置。

创建索引有效。我认为它们并不是真正必要的，除非您想优化性能。

然而，select 语句似乎忽略了这部分：整理“ca_ES”。

当我选择其他排序规则时也存在此问题。我尝试过“es_ES”和“pt_PT”，但结果相似。

原文

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

The attribute name_language contains names in various languages. The language is specified by the foreign key language_id.

I have created a few indexes:

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

Now when I do a select I am not getting the results I am expecting.

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

This generates a list of names but not in the order I expected:

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

As I expected the last two entries to appear in different positions in the list.

Creating the indexes works. I don't think they are really necessary unless you want to optimize for performance.

The select statement however seems to ignore the part: collate "ca_ES".

This problem also exists when I select other collations. I have tried "es_ES" and "pt_PT" but the results are similar.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑梦风尘 2024-12-17 15:08:43

我在你的设计中找不到缺陷。我已经尝试过了。

语言环境和排序规则

我重新审视了这个问题。考虑这个sqlfiddle 测试用例。看起来效果很好。我什至在本地测试服务器（Debian Squeeze 上的 PostgreSQL 9.1.6）中创建了区域设置 ca_ES.utf8 并将区域设置添加到我的数据库集群中：

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

我得到了与 sqlfiddle 中相同的结果多于。

请注意，排序规则名称是标识符，需要用双引号括起来以保留 CamelCase 拼写，如 "ca_ES"。也许与您系统中的其他区域设置存在一些混淆？检查您的可用排序规则：

SELECT * FROM pg_collation;

通常，排序规则是源自系统语言环境。请阅读此处手册中的详细信息。如果您仍然得到不正确的结果，我会尝试更新您的系统并重新生成 “ca_ES” 的语言环境。在 Debian（和相关的 Linux 发行版）中，这可以通过以下方式完成：

dpkg-reconfigure locales

NFC

我还有另一个想法：非标准化 UNICODE 字符串。

难道你的'Àudio'实际上是'̀ ' || “音频”？就是这个字符：

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

阅读有关维基百科中的尖音口音的更多信息。
您必须 SET standard_conforming_strings = TRUE 才能使用第一行中的 Unicode 字符串。

请注意，某些浏览器无法正确显示非规范化的 Unicode 字符，并且许多字体没有适合特殊字符的字形，因此您可能在此处看不到任何内容或乱码。但 UNICODE 允许这种胡说八道。测试看看你得到了什么：

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

如果你的数据库已经收缩了，你需要摆脱它，否则后果自负。解决方法是将字符串标准化为 NFC。 Perl 具有出色的 UNICODE-foo 技能，您可以在 plperlu 函数中利用它们的库在 PostgreSQL 中执行此操作。我这样做是为了让我免于疯狂。

阅读这篇有关 David Wheeler 撰写的关于 PostgreSQL 中的 UNICODE 规范化的优秀文章.
阅读 unicode.org 上有关 Unicode 规范化形式的所有详细信息。

I can't find a flaw in your design. I have tried.

Locales and collation

I revisited this question. Consider this test case on sqlfiddle. It seems to work just fine. I even created the locale ca_ES.utf8 in my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

I get the same results as can be seen in the sqlfiddle above.

Note that collation names are identifiers and need to be double-quoted to preserve CamelCase spelling like "ca_ES". Maybe there has been some confusion with other locales in your system? Check your available collations:

SELECT * FROM pg_collation;

Generally, collation rules are derived from system locales. Read about the details in the manual here. If you still get incorrect results, I would try to update your system and regenerate the locale for "ca_ES". In Debian (and related Linux distributions) this can be done with:

dpkg-reconfigure locales

NFC

I have one other idea: unnormalized UNICODE strings.

Could it be that your 'Àudio' is in fact '̀ ' || 'Audio'? That would be this character:

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

Read more about the acute accent in wikipedia.
You have to SET standard_conforming_strings = TRUE to use Unicode strings like in the first line.

Note that some browsers cannot display unnormalized Unicode characters correctly and many fonts have no proper glyph for the special characters, so you may see nothing here or gibberish. But UNICODE allows for that nonsense. Test to see what you got:

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

If that's what your database has contracted, you need to get rid of it or suffer the consequences. The cure is to normalize your strings to NFC. Perl has superior UNICODE-foo skills, you can make use of their libraries in a plperlu function to do it in PostgreSQL. I have done that to save me from madness.

Read installation instructions in this excellent article about UNICODE normalization in PostgreSQL by David Wheeler.
Read all the gory details about Unicode Normalization Forms at unicode.org.

回复收藏 0 原文