q-gram近似匹配优化

发布于 2024-08-15 12:28:32 字数 243 浏览 7 评论 0原文

我有一个包含 300 万人记录的表,我想使用 q-gram 对其执行模糊匹配(例如姓氏)。我创建了一个链接到此的 2-grams 表,但在此数据量上的搜索性能并不好(大约 5 分钟)。

我基本上有两个问题: (1) 您能否建议任何方法来提高性能以避免表扫描(即必须计算搜索字符串和 300 万个姓氏之间的常见 q-gram) (2) 对于q-grams,如果A与B相似并且C与B相似,是否意味着C与A相似?

亲切的问候

彼得

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes).

I basically have two questions:
(1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames)
(2) With q-grams, if A is similar to B and C is similar to B, does it imply C is similar to A?

Kind regards

Peter

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我的奇迹 2024-08-22 12:28:32

您肯定已经看到到处都是模糊文本搜索。例如,您输入“stck”,但实际上您的意思是“stack”!有没有想过这个东西是如何工作的?

有很多算法可以进行模糊文本匹配,每种算法都有自己的优点和缺点。最著名的是编辑距离和 qgram。我今天想重点关注 qgrams 并实现一个示例。

基本上,qgrams 是最适合关系数据库的模糊字符串匹配算法。这很简单。 qgram 中的“q”将被替换为 2-gram 或 3-gram 甚至 4-gram 等数字。

2-gram 意味着每个单词都被分成一组两个字符的gram。 “Stack”将被分解为一组{“st”,“ta”,“ac”,“ck”}或“database”将被分解为{“da”,“at”,“ta”,“ab ”、“ba”、“as”、“se”}。

一旦单词被分解为 2-gram,我们就可以在数据库中搜索一组值而不是一个字符串。例如,如果用户错误输入“stck”,则任何对“stck”的搜索都不会匹配“stack”,因为缺少“a”,但 2 元组 {“st”,“tc”,“ck”} 有 2 行与 2 克堆栈组相同!宾果游戏,我们发现了一场非常接近的比赛。它与 2 克数据库集没有任何共同点,并且与 2 克“stat”集只有 1 个共同点,因此我们可以轻松地建议用户他打算输入:第一个“stack”或第二个“star” ”。

现在让我们使用 Sql Server 来实现它:假设一个假设的单词数据集。 2grams 和单词之间需要有多对多的关系。

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams 表应该聚集在第一个twog 上,然后是wordId 以提高性能。当您查询一个单词(例如堆栈)时,您将克放入临时表中。首先让我们创建几百万条虚拟记录。

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

现在让我们查询单词“stack”,它将被分解为:{'st','ta','ac','ck'} 两克。

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

您应该确保 Sql Server 使用一堆聚集索引查找(或查找)来运行此查询。它应该是自然的选择,但有时统计信息可能会损坏或过时,并且 SqlServer 可能会认为完整扫描更便宜。如果它不知道左侧表的基数,通常会发生这种情况,例如,SqlServer 可能会假设 @word 表很大,并且数百万次查找将比完整索引扫描更昂贵。

You have certainly seen the fuzzy text searches everywhere. For example you type "stck" but you actually mean "stack"! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. "q" in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. "Stack" will be broken into a set of {"st", "ta", "ac", "ck"} or "database" will be broken into {"da","at","ta","ab","ba","as","se"}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped "stck", any search for "stck" will not match "stack" because "a" is missing, but the 2-gram set {"st","tc","ck"} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of "stat" so we can easily suggest the user that he meant to type: first "stack" or second, "star".

Now let's implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word "stack" which will be broken to: {'st','ta','ac','ck'} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

已下线请稍等 2024-08-22 12:28:32

我最近一直在研究模糊字符串匹配,因此即使冒着回答废弃问题的风险,这里也是如此。希望您觉得这很有用。

我想您只对编辑距离小于给定值的字符串感兴趣。你的 q-gram(或 n-gram)看起来像这样

2-grams for "foobar": {"fo","oo","ob","ba","ar"}
  1. 你可以使用位置 q-gram:

    "foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5) }
    

    位置信息可用于确定是否匹配
    q-gram 真是“绝配”。

    例如,如果您正在搜索
    具有最大编辑距离的“foobar”
    共 2 个,这意味着您仅
    对单词感兴趣

    2-gram“fo”存在于位置从 1 到 3 或
    2-gram“oo”存在于位置从 2 到 4 或
    ... 等等
    

    字符串“barfoo”没有得到任何
    匹配是因为
    否则匹配 2-gram 的不同之处在于
    3.

  2. 此外,使用它可能有用
    编辑距离之间的关系
    以及匹配 q-gram 的计数。
    直觉是,自从

    字符串 s 有 len(s)-q+1 个 q-gram

    一次编辑操作最多可以影响 q 个 q-gram,

    我们可以推断

    编辑距离 d 内的字符串 s1 和 s2 至少有
    max(len(s1),len(s2))-q+1-qk 匹配非位置 q-gram。

    如果您正在搜索“foobar”
    最大编辑距离为 2,匹配
    7 个字符的字符串(例如
    “fotocar”)应至少包含
    两个常见的 2-gram。

  3. 最后,显而易见的事情是
    按长度过滤。编辑
    两根弦之间的距离为
    最小长度差
    琴弦的。例如,如果您的
    阈值为 2 并且您搜索
    “foobar”、“foobarbar”不能
    显然匹配。

请参阅http://pages.stern.nyu.edu/~panos/ Publications/deb-dec2001.pdf 了解更多和一些伪 SQL。

I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this

2-grams for "foobar": {"fo","oo","ob","ba","ar"}
  1. You could use positional q-grams:

    "foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
    

    The positional information can be used to determine if a matching
    q-gram is really a "good match".

    For example, if you're searching for
    "foobar" with maximum edit distance
    of 2, this means that you're only
    interested in words where

    2-gram "fo" exists in with position from 1 to 3 or
    2-gram "oo" exists in with position from 2 to 4 or
    ... and so on
    

    String "barfoo" doesn't get any
    matches because the positions of the
    otherwise matching 2-grams differ by
    3.

  2. Also, it might be useful to use
    the relation between edit distance
    and the count of matching q-grams.
    The intution is that since

    a string s has len(s)-q+1 q-grams

    and

    a single edit operation can affect at most q q-grams,

    we can deduce that

    strings s1 and s2 within edit distance of d have at least
    max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.

    If you're searching for "foobar"
    with an maximum edit distance of 2, a matching
    7-character string (such as
    "fotocar") should contain at least
    two common 2-grams.

  3. Finally, the obvious thing to do is
    to filter by lenght. The edit
    distance between two strings is at
    least the difference of the lengths
    of the strings. For example if your
    threshold is 2 and you search for
    "foobar", "foobarbar" cannot
    obviously match.

See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.

︶葆Ⅱㄣ 2024-08-22 12:28:32

关于索引 DNA q-gram 的有趣论文,这样您就不必扫描整个表格:

www.comp.nus.edu.sg/~atung/publication/qgram_edit.pdf

Interesting paper about indexing DNA q-grams so you don't have to scan the entire table:

www.comp.nus.edu.sg/~atung/publication/qgram_edit.pdf

叹倦 2024-08-22 12:28:32

我有一个简单的改进,不会消除扫描,但如果您仅使用 2 克或 3 克,则会加快扫描速度:用数字替换字母。大多数 SQL 引擎在比较数字时工作速度要快得多。

示例:我们的源表在一列中包含文本条目。
我们创建一个临时表,在其中使用 a 将名称拆分为 2 克。

SELECT SUBSTRING (column, 1,2) as gram, 1 as position FROM sourcetable
UNION  
SELECT SUBSTRING (column, 2,2) as gram, 2 as position FROM sourcetable
UNION
SELECT SUBSTRING (column, 3,2) as gram, 3 as position FROM sourcetable

etc. 

这应该在循环中运行,其中 i=0 且 j=源条目的最大大小。

然后我们准备一个映射表,其中包含所有可能的 2 字母克,并包含一个名为 gram_id 的 IDENTITY (1,1) 列。我们可以按英语词典中的频率对克进行排序,并消除最不常见的克(例如“kk”或“wq”) - 这种排序可能需要一些时间和研究,但它会将最小的数字分配给最频繁的克,这如果我们可以将克数限制为 255,则可以提高性能,因为这样我们就可以对 gram_id 使用tinyint 列。

然后我们从第一个临时表重建另一个临时表,其中我们使用 gram_id 而不是 gram。这将成为主表。我们在 gram_id 列和位置列上创建索引。

然后,当我们必须将文本字符串与主表进行比较时,我们首先将文本字符串拆分为 2-gram,然后将 2-gram 替换为其 gram_id(使用映射表),并将它们与以下之一进行比较master 表

做了很多比较,但大部分都是2位整数,速度很快。

I have a simple improvement which will not eliminate the scan, but speed it up if you use 2-grams or 3-grams only: replace the letters by numbers. Most SQL engines work a lot faster when comparing numbers.

Example: our source table contains text entries in one column.
We create a temp table where we split the names in 2-grams using a

SELECT SUBSTRING (column, 1,2) as gram, 1 as position FROM sourcetable
UNION  
SELECT SUBSTRING (column, 2,2) as gram, 2 as position FROM sourcetable
UNION
SELECT SUBSTRING (column, 3,2) as gram, 3 as position FROM sourcetable

etc. 

This should run in a loop where i=0 and j=the max size of a source entry.

Then we prepare a mapping table which contains all possible 2-letter grams and include a IDENTITY (1,1) column called gram_id. We may sort the grams by frequency in the English dictionary and eliminate the most infrequent grams (like 'kk' or 'wq') - this sorting may take some time and research but it will assign the smallest numbers to the most frequent grams, which will then improve performance if we can limit the number of grams to 255 because then we can use a tinyint column for the gram_id.

Then we rebuild another temp table from the first one, where we use the gram_id instead of the gram. That becomes the master table. We create an index on the gram_id column and on the position column.

Then when we have to compare a text string to the master table, we first split the text string split it into 2-grams, then replace the 2-grams by their gram_id (using the mapping table), and compare them to the one of the master table

That makes a lot of comparisons, but most of them are 2-digit integers, which is very quick.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文