PostgreSQL 全文搜索随机删除词位

发布于 2024-09-29 04:43:07 字数 3922 浏览 0 评论 0原文

(所有这些都是在 Ubuntu 10.04 上使用 PostgreSQL 9.0.1 完成的,FWIW。)

我尝试基于双变音位算法为 PostgreSQL 编写自己的 lexize 函数,专门用于名称。为了生成 C lexize 函数,我从 dict_xsyn 示例开始,基本上将双变音位替换为同义词查找。

但也许有 20% 的时间 to_tsvector 被调用,它似乎正在丢弃词位。例如,我创建了一个表,其中包含美国人口普查局的姓名列表

db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank 
----------+-------+---------+------
 SMITH    | 1.006 |   1.006 |    1
 JOHNSON  |  0.81 |   1.816 |    2
 WILLIAMS | 0.699 |   2.515 |    3
 JONES    | 0.621 |   3.136 |    4
 BROWN    | 0.621 |   3.757 |    5
 DAVIS    |  0.48 |   4.237 |    6
 MILLER   | 0.424 |    4.66 |    7
 WILSON   | 0.339 |       5 |    8
(8 rows)

然后我们可以添加一个向量列,并用我的变音位字典的 to_tsvector 填充它:

db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank |              vec              
----------+-------+---------+------+-------------------------------
 SMITH    | 1.006 |   1.006 |    1 | 
 JOHNSON  |  0.81 |   1.816 |    2 | 'ANSN':1 'JNSN':1 'johnson':1
 WILLIAMS | 0.699 |   2.515 |    3 | 
 JONES    | 0.621 |   3.136 |    4 | 
 BROWN    | 0.621 |   3.757 |    5 | 
 DAVIS    |  0.48 |   4.237 |    6 | 
 MILLER   | 0.424 |    4.66 |    7 | 'MLR':1 'miller':1
 WILSON   | 0.339 |       5 |    8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)

一堆 vec 字段只是空的!事实上:

db=# select count(1) from names where vec = to_tsvector('');
 count 
-------
 41101
(1 row)

我可以进行批量更新,然后重复计算不匹配的字段,并且每次得到不同的计数,因为不同的行无法计算真实的向量。

但我认为我的 lexize 函数本身还可以吗?

db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
   name   | freq  | cumfreq | rank |       lexemes        
----------+-------+---------+------+----------------------
 SMITH    | 1.006 |   1.006 |    1 | {smith,SM0,XMT}
 JOHNSON  |  0.81 |   1.816 |    2 | {johnson,JNSN,ANSN}
 WILLIAMS | 0.699 |   2.515 |    3 | {williams,ALMS,FLMS}
 JONES    | 0.621 |   3.136 |    4 | {jones,JNS,ANS}
 BROWN    | 0.621 |   3.757 |    5 | {brown,PRN}
 DAVIS    |  0.48 |   4.237 |    6 | {davis,TFS}
 MILLER   | 0.424 |    4.66 |    7 | {miller,MLR}
 WILSON   | 0.339 |       5 |    8 | {wilson,ALSN,FLSN}
 MOORE    | 0.312 |   5.312 |    9 | {moore,MR}
 TAYLOR   | 0.311 |   5.623 |   10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
 count 
-------
     0
(1 row)

我可以一遍又一遍地这样做,每次都会得到 0 个不匹配。

我正在创建相关的文本搜索词典和配置:

CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE OR REPLACE FUNCTION ddmetaphonedict_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
        LEXIZE = ddmetaphonedict_lexize,
        INIT   = ddmetaphonedict_init
);

CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
        TEMPLATE = dmetaphonedict_template
);

COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';

create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;

几乎直接来自 dict_xsyn 示例。

这是怎么回事?我该怎么做才能找出这些东西被丢弃的地方?

添加:刚刚在 9.1 上尝试过。 to_tsvector 生成无用 tsvector 的速率大大降低(80000 多个记录中的 5 个(只是 5,而不是 5k)到 7000 个记录)。但他们还在那里。

(All this was done with PostgreSQL 9.0.1 on Ubuntu 10.04, FWIW.)

I've tried writing my own lexize function for PostgreSQL specifically for names, based on the double metaphone algorithm. To produce the C lexize function, I started with the dict_xsyn example, and basically swapped double metaphone in for the synonym lookup.

But maybe 20% of the time to_tsvector is invoked, it seems to be dropping lexemes. As an example, I created a table containing the US Census Bureau's list of names.

db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank 
----------+-------+---------+------
 SMITH    | 1.006 |   1.006 |    1
 JOHNSON  |  0.81 |   1.816 |    2
 WILLIAMS | 0.699 |   2.515 |    3
 JONES    | 0.621 |   3.136 |    4
 BROWN    | 0.621 |   3.757 |    5
 DAVIS    |  0.48 |   4.237 |    6
 MILLER   | 0.424 |    4.66 |    7
 WILSON   | 0.339 |       5 |    8
(8 rows)

We can then add a vector column, and populate it with to_tsvector of my metaphone dictionary:

db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank |              vec              
----------+-------+---------+------+-------------------------------
 SMITH    | 1.006 |   1.006 |    1 | 
 JOHNSON  |  0.81 |   1.816 |    2 | 'ANSN':1 'JNSN':1 'johnson':1
 WILLIAMS | 0.699 |   2.515 |    3 | 
 JONES    | 0.621 |   3.136 |    4 | 
 BROWN    | 0.621 |   3.757 |    5 | 
 DAVIS    |  0.48 |   4.237 |    6 | 
 MILLER   | 0.424 |    4.66 |    7 | 'MLR':1 'miller':1
 WILSON   | 0.339 |       5 |    8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)

A bunch of the vec fields are just empty! In fact:

db=# select count(1) from names where vec = to_tsvector('');
 count 
-------
 41101
(1 row)

I can do that bulk update, and then count the mismatched fields repeatedly, and get different counts each time, because different rows are failing to have real vectors computed.

But I think my lexize function itself is ok?

db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
   name   | freq  | cumfreq | rank |       lexemes        
----------+-------+---------+------+----------------------
 SMITH    | 1.006 |   1.006 |    1 | {smith,SM0,XMT}
 JOHNSON  |  0.81 |   1.816 |    2 | {johnson,JNSN,ANSN}
 WILLIAMS | 0.699 |   2.515 |    3 | {williams,ALMS,FLMS}
 JONES    | 0.621 |   3.136 |    4 | {jones,JNS,ANS}
 BROWN    | 0.621 |   3.757 |    5 | {brown,PRN}
 DAVIS    |  0.48 |   4.237 |    6 | {davis,TFS}
 MILLER   | 0.424 |    4.66 |    7 | {miller,MLR}
 WILSON   | 0.339 |       5 |    8 | {wilson,ALSN,FLSN}
 MOORE    | 0.312 |   5.312 |    9 | {moore,MR}
 TAYLOR   | 0.311 |   5.623 |   10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
 count 
-------
     0
(1 row)

I can do that over and over and get 0 mismatches each time.

I'm creating the relevant text search dictionaries and configurations with:

CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE OR REPLACE FUNCTION ddmetaphonedict_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
        LEXIZE = ddmetaphonedict_lexize,
        INIT   = ddmetaphonedict_init
);

CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
        TEMPLATE = dmetaphonedict_template
);

COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';

create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;

Pretty much straight out of the dict_xsyn example.

What's going on? What can do I do to figure out where these things are being dropped?

ADDED: Just tried this with 9.1. The rate at which to_tsvector produces a dud tsvector is greatly reduced (5 (just 5, not 5k) to 7000 records out of 80000+). But they are still there.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

旧情勿念 2024-10-06 04:43:07

我想首先消除 to_tsvector() 作为一个原因。

我可能首先尝试更改 to_tsvector() 以便它返回一个常量。例如,将其更改为返回“'ANSN':1 'JNSN':1 'johnson':1”,无论其输入参数是什么。 (即使输入参数为 NULL。)如果您的批量更新仍然在“vec”列中留下漏洞,我认为这会清除 to_tsvector()。

我的说法正确吗?

还要考虑一个两步过程,使用 lexize() 填充“lexemes”列,然后使用另一个函数读取词位并生成“vec”列中的值。

I'd like to first eliminate to_tsvector() as a cause.

I might first try changing to_tsvector() so it returns a constant. For example, change it to return "'ANSN':1 'JNSN':1 'johnson':1" no matter what its input arguments are. (Even if input arguments are NULL.) If your bulk update is still leaving holes in the column "vec", I think that would clear to_tsvector().

Am I right about that?

Also consider a two-step process, using lexize() to populate the column "lexemes", then another function to read the lexemes and produce the values in the column "vec".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文