mysql/模糊搜索的Levenshtein距离的实现?
我希望能够按如下方式搜索 smith 的表格,以获得 1 方差以内的所有内容。
数据:
O'Brien Smithe Dolan Smuth Wong Smoth Gunther Smiht
我研究过使用编辑距离,有人知道如何用它来实现吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我正在设置一个基于 Levenshtein 或 Damerau-Levenshtein (可能是后者)的搜索,以对索引文本进行多次搜索,基于 Gonzalo Navarro 和 Ricardo Baeza-yates 的论文: 链接文本
构建后缀数组后(参见 wikipedia),如果您对与搜索字符串最多有 k 个不匹配的字符串感兴趣,请将搜索字符串分成 k+1 段; 至少其中之一必须完好无损。 通过对后缀数组进行二分查找来查找子字符串,然后将距离函数应用于每个匹配片段周围的补丁。
I am setting up a search based on Levenshtein or Damerau-Levenshtein (probably the latter) for multiple searches over an indexed text, based on a paper by by Gonzalo Navarro and Ricardo Baeza-yates: link text
After building a suffix array (see wikipedia), if you are interested in a string with at most k mismatches to the search string, break the search string into k+1 pieces; at least one of those must be intact. Find the substrings by binary search over the suffix array, then apply the distance function to the patch around each matched piece.
您可以使用此函数
并将其获取为 XX% 使用此函数
you can use this function
and for getting it as XX% use this function
我有一个特殊的 k 距离搜索案例,在 MySQL 中安装 Damerau-Levenshtein UDF 后发现查询花费的时间太长。 我想出了以下解决方案:
创建一个新表(或将列附加到目标表),其中包含目标字段中每个字符位置的列。 IE。 我的 VARCHAR(9) 最终成为 9 个 TINYINT 列 + 1 个 Id 列,与我的主表匹配(为每列添加索引)。 我添加了触发器以确保当我的主表更新时这些新列始终得到更新。
要执行 k 距离查询,请使用以下谓词:
(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + .. . >= m
其中 s 是您的搜索字符串,m 是所需的匹配字符数(或者在我的情况下 m = 9 - d,其中 d 是我想要返回的最大距离)。
经过测试,我发现平均需要 4.6 秒的超过 100 万行的查询在不到一秒的时间内返回匹配的 id。 返回主表中匹配行的数据的第二个查询同样花费了不到一秒钟的时间。 (将这两个查询组合为子查询或联接会导致执行时间显着延长,我不确定为什么。)
虽然这不是 Damerau-Levenshtein (不考虑替换),但它足以满足我的目的。
尽管此解决方案对于较大(长度)的搜索空间可能无法很好地扩展,但它对于这种限制性情况非常有效。
I had a specialized case of k-distance searching and after installing the Damerau-Levenshtein UDF in MySQL found that the query was taking too long. I came up with the following solution:
Create a new table (or append columns to your target table) with columns for each character position in your target field. ie. My VARCHAR(9) ended up as 9 TINYINT columns + 1 Id column that matches my main table (add indexes for each column). I added triggers to ensure that these new columns always get updated when my main table gets updated.
To perform a k-distance query use the following predicate:
(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + ... >= m
where s is your search string and m is the required number of matching characters (or m = 9 - d in my case where d is the maximum distance I want returned).
After testing I found that a query over 1 million rows that was taking 4.6 seconds on average was returning matching ids in less than a second. A second query to return the data for the matching rows in my main table similarly took under a second. (Combining these two queries as a subquery or join resulted in significantly longer execution times and I'm not sure why.)
Though this is not Damerau-Levenshtein (doesn't account for substitution) it suffices for my purposes.
Though this solution probably doesn't scale well for a larger (length) search space it worked for this restrictive case very well.
为了使用编辑距离进行有效搜索,您需要一个高效、专门的索引,例如 bk-tree。 不幸的是,我所知道的数据库系统(包括 MySQL)都没有实现 bk-tree 索引。 如果您正在寻找全文搜索,而不是每行单个术语,则情况会更加复杂。 临时,我想不出任何可以允许基于编辑距离进行搜索的方式进行全文索引的方法。
In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.
Levenshtein Distance函数有一个mysql UDF实现
https://github.com/jmcejuela/Levenshtein-MySQL -UDF
用C实现,比schnaader提到的“MySQL Levenshtein距离查询”有更好的性能
There is a mysql UDF implementation of Levenshtein Distance function
https://github.com/jmcejuela/Levenshtein-MySQL-UDF
It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader
上面为 levenshtein <= 1 给出的函数是不正确的——它给出了错误的结果,例如“bed”和“bid”。
我修改了上面第一个答案中给出的“MySQL Levenshtein距离查询”,以接受一个“限制”,这将加快速度。 基本上,如果您只关心 Levenshtein <= 1,则将限制设置为“2”,如果它是 0 或 1,该函数将返回精确的 levenshtein 距离; 如果精确的编辑距离为 2 或更大,则为 2。
此 mod 使其速度提高 15% 到 50% - 搜索词越长,优势就越大(因为算法可以更早地退出。)例如,在搜索 200,000 个单词时,查找单词距离 1 内的所有匹配项“咯咯笑”,原版在我的笔记本电脑上需要 3 分 47 秒,而“限制”版本则需要 1 分 39 秒。 当然,这些对于任何实时使用来说都太慢了。
代码:
The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".
I modified the "MySQL Levenshtein distance query" given above, in the first answer, to accept a "limit" that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.
This mod makes it 15% to 50% faster -- the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word "giggle," the original takes 3 min 47 sec on my laptop, versus 1:39 for the "limit" version. Of course, these are both too slow for any real-time use.
Code:
可以在此处找到 damerau-levenshtein 距离的实现:
Damerau-Levenshtein 算法:具有换位的 Levenshtein
相对于纯编辑距离的改进在于考虑了字符的交换。 我在schnaader的链接的评论中找到了它,谢谢!
An implementation for the damerau-levenshtein distance can be found here:
Damerau-Levenshtein algorithm: Levenshtein with transpositions
The improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader's link, thanks!
如果您只想知道levenshtein-distance是否最多为1,您可以使用以下MySQL函数。
这基本上是编辑距离的递归描述中的一个步骤。
如果距离最多为 1,则该函数返回 1,否则返回 0。
由于该函数不完全计算编辑距离,因此速度要快得多。
您还可以修改此函数,使其在 levenshtein-distance 至多为 2 或 3 时返回
true
,方法是通过递归调用它。 如果MySQL不支持递归调用,您可以复制此函数的稍微修改版本两次并调用它们。 但您不应该使用递归函数来计算精确的编辑距离。If you only want to know if the levenshtein-distance is at most 1, you can use the following MySQL function.
This is basically a single step in the recursive description of the levenshtein distance.
The function returns 1, if the distance is at most 1, else it returns 0.
Since this function does not completely compute the levenshtein-distance, it is much faster.
You can also modify this function such that it returns
true
if the levenshtein-distance is at most 2 or 3, by calling it self recursively. If MySQL does not support recursive calls, you can copy slightly modified versions of this function two times and call them instead. But you should not use the recursive function to calculate the exact levenshtein-distance.基于 Chella 的回答 和 Ryan Ginstrom 的 文章,可以实现模糊搜索如此:
Based on Chella's answer and Ryan Ginstrom's article, a fuzzy search could be implemented as so: