查询数据库中名称的建议

发布于 2024-12-07 17:23:30 字数 1332 浏览 1 评论 0原文

我有一个 Oracle 数据库，与许多数据库一样，它有一个包含个人信息的表。对此，我想以“自然”的方式按名称进行搜索。

该表有 forename 和 surname 字段，目前，我正在使用类似这样的内容：

select id, forename, surname
from   mytable
where  upper(forename) like '%JOHN%'
and    upper(surname) like '%SMITH%';

这可行，但可能会很慢，因为该表上的索引显然可以' t 考虑前面的通配符。此外，用户通常会根据他们通过电话告诉他们的内容来搜索人员（包括大量非英语名字），因此最好也进行一些语音分析。

因此，我一直在尝试使用 Oracle Text：

create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;

select   score(1)+score(2) relevance,
         id,
         forename,
         surname
from     mytable
where    contains(forename,'!%john%',1) > 0
and      contains(surname,'!%smith%',2) > 0
order by relevance desc;

这具有使用 Soundex 算法和全文索引的优点，因此它应该会更高效一些。（尽管，我的轶事结果表明它相当慢！）我对此唯一的担忧是：

首先，文本索引需要以某种有意义的方式刷新。使用on commit会太慢，并且可能会干扰前端软件（这是我无法控制的）与数据库的交互方式；因此需要一些思考...
Oracle 返回的结果并不是完全自然排序的；我不太确定这个 score 函数。例如，我的开发数据显示“Jonathan Peter Jason Smith”位于顶部（很好），但也显示“Jane Margaret Simpson”与“John Terrance Smith”处于同一级别

我认为删除前面的通配符可能会有所改善性能而不降低结果，因为在现实生活中，您永远不会搜索名称中间的块。然而，除此之外，我对想法持开放态度......这种情况一定是被实施得令人作呕！谁能对我现在正在做/考虑的事情提出更好的方法？

谢谢：）

原文

I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.

The table has forename and surname fields and, currently, I am using something like this:

select id, forename, surname
from   mytable
where  upper(forename) like '%JOHN%'
and    upper(surname) like '%SMITH%';

This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.

As such, I have been experimenting with Oracle Text:

create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;

select   score(1)+score(2) relevance,
         id,
         forename,
         surname
from     mytable
where    contains(forename,'!%john%',1) > 0
and      contains(surname,'!%smith%',2) > 0
order by relevance desc;

This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:

Firstly, the text indices need to be refreshed in some meaningful way. Using on commit would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...
The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this score function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"

I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?

Thanks :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

糖果控 2024-12-14 17:23:30

按照评论中的建议，我提出了一个效果很好的解决方案。特别是，@X-Zero 关于创建 Soundexes 表的建议：就我而言，我可以创建新表，但不允许更改现有架构！

因此，我的流程如下：

创建一个新表，其中包含以下列：ID、token、sound 和 position< /代码>;主键位于 (ID, sound,position) 上，附加索引位于 (ID,sound)。
浏览简历表中的每个人：
- 连接他们的名字和姓氏。
- 将代码页更改为 us7ascii，以便重音字符标准化。这是因为 Soundex 算法不适用于重音字符。
- 将所有非字母字符转换为空格，并将其视为标记之间的边界。
- 对该字符串进行标记，并将标记（小写）、标记的 Soundex 以及标记在原始字符串中的位置插入表中；将此与 ID 相关联。

像这样：

declare
  nameString varchar2(82);
  token varchar2(40);
  posn integer;
  cursor myNames is
    select id,
           forename||' '||surname person_name
    from   mypeople;
begin
  for person in myNames
  loop
    nameString := trim(
                    utl_i18n.escape_reference(
                      regexp_replace(
                        regexp_replace(person.person_name,'[^[:alpha:]]',' '),
                        '\s+',' '),
                      'us7ascii')
                    )||' ';
    posn := 1;
    while nameString is not null
    loop
      token := substr(nameString,1,instr(nameString,' ') - 1);
      insert into personsearch values (person.id,lower(token),soundex(token),posn);
      nameString := substr(nameString,instr(nameString,' ') + 1);
      posn := posn + 1;
    end loop;
  end loop;
end;
/

例如，“Siân O'Conner”被标记为“sian”（位置 1）、“o”（位置 2）和“conner”（位置 3），这三个条目及其 Soundex 得到连同 ID 一起插入到 personsearch 中。

为了进行搜索，我们执行相同的过程：对搜索条件进行标记，然后返回 Soundexe 和相对位置匹配的结果。我们依次按位置和与原始搜索的编辑距离 (ld) 对每个标记进行排序。

例如，此查询将搜索两个标记（即预先标记化的搜索字符串）：

with     searchcriteria as (
         select 'john'  token1,
                'smith' token2
         from   dual)
select   alpha.id,
         mypeople.forename||' '||mypeople.surname
from     peoplesearch alpha
join     mypeople
on       mypeople.student_id = alpha.student_id
join     peoplesearch beta
on       beta.student_id = alpha.student_id
and      beta.position   > alpha.position
join     searchcriteria
on       1 = 1
where    alpha.sound = soundex(searchcriteria.token1)
and      beta.sound  = soundex(searchcriteria.token2)
order by alpha.position,
         ld(alpha.token,searchcriteria.token1),
         beta.position,
         ld(beta.token,searchcriteria.token2),
         alpha.student_id;

要搜索任意数量的标记，我们需要使用动态 SQL：加入搜索表的次数与标记的数量一样多，其中连接表中的 position 字段必须大于先前连接表的 position...我计划编写一个函数来执行此操作 - 以及搜索字符串标记化——这将返回一个 ID 表。不过，我只是将其发布在这里，以便您了解这个想法:)

正如我所说，这非常有效：它很快就会返回良好的结果。即使搜索“John Smith”，一旦被服务器缓存，运行时间也不到 0.2 秒；返回超过 200 行...我对此非常满意，并且希望将其投入生产。唯一的问题是：

令牌的预先计算需要一段时间，但这是一个一次性过程，因此不会有太大问题。然而，一个相关的问题是，每当对 mypeople 执行相应操作时，都需要在 mypeople 表上放置一个触发器，以便将标记插入/更新/删除到搜索表中。这可能会减慢系统速度；但由于这种情况只会在一年中的几个时期发生，也许更好的解决方案是定期重建搜索表。
没有进行词干提取，因此 Soundex 算法仅匹配完整的标记。例如，搜索“chris”将不会返回任何“christopher”。一个可能的解决方案是只存储 token 词干的 Soundex，但是计算词干并不是一个简单的问题！这将是未来的升级，可能会使用 TeX 使用的连字符引擎...

无论如何，希望有所帮助:) 欢迎评论！

编辑我的完整解决方案（编写和实施）现在是此处，使用 Metaphone 和 Damerau-Levenshtein 距离。

I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, @X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!

So, my process is as follows:

Create a new table with columns: ID, token, sound and position; with the primary key over (ID, sound,position) and an additional index over (ID,sound).
Go through each person in the biographical table:
- Concatenate their forename and surname.
- Change the codepage to us7ascii, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
- Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
- Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID.

Like so:

declare
  nameString varchar2(82);
  token varchar2(40);
  posn integer;
  cursor myNames is
    select id,
           forename||' '||surname person_name
    from   mypeople;
begin
  for person in myNames
  loop
    nameString := trim(
                    utl_i18n.escape_reference(
                      regexp_replace(
                        regexp_replace(person.person_name,'[^[:alpha:]]',' '),
                        '\s+',' '),
                      'us7ascii')
                    )||' ';
    posn := 1;
    while nameString is not null
    loop
      token := substr(nameString,1,instr(nameString,' ') - 1);
      insert into personsearch values (person.id,lower(token),soundex(token),posn);
      nameString := substr(nameString,instr(nameString,' ') + 1);
      posn := posn + 1;
    end loop;
  end loop;
end;
/

So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch along with their ID.

To search, we do the same process: tokenise the search criteria and then return results where the Soundexes and relative positions match. We order by the position and then the Levenshtein distance (ld) from the original search for each token, in turn.

This query, for example, will search against two tokens (i.e., pre-tokenised search string):

with     searchcriteria as (
         select 'john'  token1,
                'smith' token2
         from   dual)
select   alpha.id,
         mypeople.forename||' '||mypeople.surname
from     peoplesearch alpha
join     mypeople
on       mypeople.student_id = alpha.student_id
join     peoplesearch beta
on       beta.student_id = alpha.student_id
and      beta.position   > alpha.position
join     searchcriteria
on       1 = 1
where    alpha.sound = soundex(searchcriteria.token1)
and      beta.sound  = soundex(searchcriteria.token2)
order by alpha.position,
         ld(alpha.token,searchcriteria.token1),
         beta.position,
         ld(beta.token,searchcriteria.token2),
         alpha.student_id;

To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position field in the joined table must be greater than the position of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)

As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:

The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...

Anyway, hope that helps :) Comments welcome!

EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.

回复收藏 0 原文

~没有更多了~