PostgreSQL 全文搜索随机删除词位

发布于 2024-09-29 04:43:07 字数 3922 浏览 0 评论 0原文

（所有这些都是在 Ubuntu 10.04 上使用 PostgreSQL 9.0.1 完成的，FWIW。）

我尝试基于双变音位算法为 PostgreSQL 编写自己的 lexize 函数，专门用于名称。为了生成 C lexize 函数，我从 dict_xsyn 示例开始，基本上将双变音位替换为同义词查找。

但也许有 20% 的时间 to_tsvector 被调用，它似乎正在丢弃词位。例如，我创建了一个表，其中包含美国人口普查局的姓名列表。

db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank 
----------+-------+---------+------
 SMITH    | 1.006 |   1.006 |    1
 JOHNSON  |  0.81 |   1.816 |    2
 WILLIAMS | 0.699 |   2.515 |    3
 JONES    | 0.621 |   3.136 |    4
 BROWN    | 0.621 |   3.757 |    5
 DAVIS    |  0.48 |   4.237 |    6
 MILLER   | 0.424 |    4.66 |    7
 WILSON   | 0.339 |       5 |    8
(8 rows)

然后我们可以添加一个向量列，并用我的变音位字典的 to_tsvector 填充它：

db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank |              vec              
----------+-------+---------+------+-------------------------------
 SMITH    | 1.006 |   1.006 |    1 | 
 JOHNSON  |  0.81 |   1.816 |    2 | 'ANSN':1 'JNSN':1 'johnson':1
 WILLIAMS | 0.699 |   2.515 |    3 | 
 JONES    | 0.621 |   3.136 |    4 | 
 BROWN    | 0.621 |   3.757 |    5 | 
 DAVIS    |  0.48 |   4.237 |    6 | 
 MILLER   | 0.424 |    4.66 |    7 | 'MLR':1 'miller':1
 WILSON   | 0.339 |       5 |    8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)

一堆 vec 字段只是空的！事实上：

db=# select count(1) from names where vec = to_tsvector('');
 count 
-------
 41101
(1 row)

我可以进行批量更新，然后重复计算不匹配的字段，并且每次得到不同的计数，因为不同的行无法计算真实的向量。

但我认为我的 lexize 函数本身还可以吗？

db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
   name   | freq  | cumfreq | rank |       lexemes        
----------+-------+---------+------+----------------------
 SMITH    | 1.006 |   1.006 |    1 | {smith,SM0,XMT}
 JOHNSON  |  0.81 |   1.816 |    2 | {johnson,JNSN,ANSN}
 WILLIAMS | 0.699 |   2.515 |    3 | {williams,ALMS,FLMS}
 JONES    | 0.621 |   3.136 |    4 | {jones,JNS,ANS}
 BROWN    | 0.621 |   3.757 |    5 | {brown,PRN}
 DAVIS    |  0.48 |   4.237 |    6 | {davis,TFS}
 MILLER   | 0.424 |    4.66 |    7 | {miller,MLR}
 WILSON   | 0.339 |       5 |    8 | {wilson,ALSN,FLSN}
 MOORE    | 0.312 |   5.312 |    9 | {moore,MR}
 TAYLOR   | 0.311 |   5.623 |   10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
 count 
-------
     0
(1 row)

我可以一遍又一遍地这样做，每次都会得到 0 个不匹配。

我正在创建相关的文本搜索词典和配置：

CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE OR REPLACE FUNCTION ddmetaphonedict_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
        LEXIZE = ddmetaphonedict_lexize,
        INIT   = ddmetaphonedict_init
);

CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
        TEMPLATE = dmetaphonedict_template
);

COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';

create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;

几乎直接来自 dict_xsyn 示例。

这是怎么回事？我该怎么做才能找出这些东西被丢弃的地方？

添加：刚刚在 9.1 上尝试过。 to_tsvector 生成无用 tsvector 的速率大大降低（80000 多个记录中的 5 个（只是 5，而不是 5k）到 7000 个记录）。但他们还在那里。

原文

(All this was done with PostgreSQL 9.0.1 on Ubuntu 10.04, FWIW.)

I've tried writing my own lexize function for PostgreSQL specifically for names, based on the double metaphone algorithm. To produce the C lexize function, I started with the dict_xsyn example, and basically swapped double metaphone in for the synonym lookup.

But maybe 20% of the time to_tsvector is invoked, it seems to be dropping lexemes. As an example, I created a table containing the US Census Bureau's list of names.

db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank 
----------+-------+---------+------
 SMITH    | 1.006 |   1.006 |    1
 JOHNSON  |  0.81 |   1.816 |    2
 WILLIAMS | 0.699 |   2.515 |    3
 JONES    | 0.621 |   3.136 |    4
 BROWN    | 0.621 |   3.757 |    5
 DAVIS    |  0.48 |   4.237 |    6
 MILLER   | 0.424 |    4.66 |    7
 WILSON   | 0.339 |       5 |    8
(8 rows)

We can then add a vector column, and populate it with to_tsvector of my metaphone dictionary:

db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank |              vec              
----------+-------+---------+------+-------------------------------
 SMITH    | 1.006 |   1.006 |    1 | 
 JOHNSON  |  0.81 |   1.816 |    2 | 'ANSN':1 'JNSN':1 'johnson':1
 WILLIAMS | 0.699 |   2.515 |    3 | 
 JONES    | 0.621 |   3.136 |    4 | 
 BROWN    | 0.621 |   3.757 |    5 | 
 DAVIS    |  0.48 |   4.237 |    6 | 
 MILLER   | 0.424 |    4.66 |    7 | 'MLR':1 'miller':1
 WILSON   | 0.339 |       5 |    8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)

A bunch of the vec fields are just empty! In fact:

db=# select count(1) from names where vec = to_tsvector('');
 count 
-------
 41101
(1 row)

I can do that bulk update, and then count the mismatched fields repeatedly, and get different counts each time, because different rows are failing to have real vectors computed.

But I think my lexize function itself is ok?

db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
   name   | freq  | cumfreq | rank |       lexemes        
----------+-------+---------+------+----------------------
 SMITH    | 1.006 |   1.006 |    1 | {smith,SM0,XMT}
 JOHNSON  |  0.81 |   1.816 |    2 | {johnson,JNSN,ANSN}
 WILLIAMS | 0.699 |   2.515 |    3 | {williams,ALMS,FLMS}
 JONES    | 0.621 |   3.136 |    4 | {jones,JNS,ANS}
 BROWN    | 0.621 |   3.757 |    5 | {brown,PRN}
 DAVIS    |  0.48 |   4.237 |    6 | {davis,TFS}
 MILLER   | 0.424 |    4.66 |    7 | {miller,MLR}
 WILSON   | 0.339 |       5 |    8 | {wilson,ALSN,FLSN}
 MOORE    | 0.312 |   5.312 |    9 | {moore,MR}
 TAYLOR   | 0.311 |   5.623 |   10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
 count 
-------
     0
(1 row)

I can do that over and over and get 0 mismatches each time.

I'm creating the relevant text search dictionaries and configurations with:

CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE OR REPLACE FUNCTION ddmetaphonedict_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
        LEXIZE = ddmetaphonedict_lexize,
        INIT   = ddmetaphonedict_init
);

CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
        TEMPLATE = dmetaphonedict_template
);

COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';

create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;

Pretty much straight out of the dict_xsyn example.

What's going on? What can do I do to figure out where these things are being dropped?

ADDED: Just tried this with 9.1. The rate at which to_tsvector produces a dud tsvector is greatly reduced (5 (just 5, not 5k) to 7000 records out of 80000+). But they are still there.

分享到QQ

分享到微博