查询数据库中名称的建议
我有一个 Oracle 数据库,与许多数据库一样,它有一个包含个人信息的表。对此,我想以“自然”的方式按名称进行搜索。
该表有 forename
和 surname
字段,目前,我正在使用类似这样的内容:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
这可行,但可能会很慢,因为该表上的索引显然可以' t 考虑前面的通配符。此外,用户通常会根据他们通过电话告诉他们的内容来搜索人员(包括大量非英语名字),因此最好也进行一些语音分析。
因此,我一直在尝试使用 Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
这具有使用 Soundex 算法和全文索引的优点,因此它应该会更高效一些。 (尽管,我的轶事结果表明它相当慢!)我对此唯一的担忧是:
首先,文本索引需要以某种有意义的方式刷新。使用
on commit
会太慢,并且可能会干扰前端软件(这是我无法控制的)与数据库的交互方式;因此需要一些思考...Oracle 返回的结果并不是完全自然排序的;我不太确定这个
score
函数。例如,我的开发数据显示“Jonathan Peter Jason Smith”位于顶部(很好),但也显示“Jane Margaret Simpson”与“John Terrance Smith”处于同一级别
我认为删除前面的通配符可能会有所改善性能而不降低结果,因为在现实生活中,您永远不会搜索名称中间的块。然而,除此之外,我对想法持开放态度......这种情况一定是被实施得令人作呕!谁能对我现在正在做/考虑的事情提出更好的方法?
谢谢 :)
I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.
The table has forename
and surname
fields and, currently, I am using something like this:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.
As such, I have been experimenting with Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:
Firstly, the text indices need to be refreshed in some meaningful way. Using
on commit
would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this
score
function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"
I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?
Thanks :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
按照评论中的建议,我提出了一个效果很好的解决方案。特别是,@X-Zero 关于创建 Soundexes 表的建议:就我而言,我可以创建新表,但不允许更改现有架构!
因此,我的流程如下:
创建一个新表,其中包含以下列:
ID
、token
、sound
和position< /代码>;主键位于 (
ID
,sound
,position
) 上,附加索引位于 (ID
,sound
)。浏览简历表中的每个人:
连接他们的名字和姓氏。
将代码页更改为
us7ascii
,以便重音字符标准化。这是因为 Soundex 算法不适用于重音字符。将所有非字母字符转换为空格,并将其视为标记之间的边界。
对该字符串进行标记,并将标记(小写)、标记的 Soundex 以及标记在原始字符串中的位置插入表中;将此与
ID
相关联。像这样:
例如,“Siân O'Conner”被标记为“sian”(位置 1)、“o”(位置 2)和“conner”(位置 3),这三个条目及其 Soundex 得到连同 ID 一起插入到
personsearch
中。ld
) 对每个标记进行排序。例如,此查询将搜索两个标记(即预先标记化的搜索字符串):
要搜索任意数量的标记,我们需要使用动态 SQL:加入搜索表的次数与标记的数量一样多,其中连接表中的
position
字段必须大于先前连接表的position
...我计划编写一个函数来执行此操作 - 以及搜索字符串标记化——这将返回一个 ID 表。不过,我只是将其发布在这里,以便您了解这个想法:)正如我所说,这非常有效:它很快就会返回良好的结果。即使搜索“John Smith”,一旦被服务器缓存,运行时间也不到 0.2 秒;返回超过 200 行...我对此非常满意,并且希望将其投入生产。唯一的问题是:
令牌的预先计算需要一段时间,但这是一个一次性过程,因此不会有太大问题。然而,一个相关的问题是,每当对
mypeople
执行相应操作时,都需要在mypeople
表上放置一个触发器,以便将标记插入/更新/删除到搜索表中。这可能会减慢系统速度;但由于这种情况只会在一年中的几个时期发生,也许更好的解决方案是定期重建搜索表。没有进行词干提取,因此 Soundex 算法仅匹配完整的标记。例如,搜索“chris”将不会返回任何“christopher”。一个可能的解决方案是只存储 token 词干的 Soundex,但是计算词干并不是一个简单的问题!这将是未来的升级,可能会使用 TeX 使用的连字符引擎...
无论如何,希望有所帮助:) 欢迎评论!
编辑我的完整解决方案(编写和实施)现在是此处,使用 Metaphone 和 Damerau-Levenshtein 距离。
I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, @X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!
So, my process is as follows:
Create a new table with columns:
ID
,token
,sound
andposition
; with the primary key over (ID
,sound
,position
) and an additional index over (ID
,sound
).Go through each person in the biographical table:
Concatenate their forename and surname.
Change the codepage to
us7ascii
, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with
ID
.Like so:
So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into
personsearch
along with their ID.ld
) from the original search for each token, in turn.This query, for example, will search against two tokens (i.e., pre-tokenised search string):
To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the
position
field in the joined table must be greater than theposition
of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:
The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the
mypeople
table to insert/update/delete tokens into the search table whenever the corresponding operation is performed onmypeople
. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...
Anyway, hope that helps :) Comments welcome!
EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.