优化编辑距离算法
我有一个存储过程,它使用编辑距离来确定最接近用户键入内容的结果。唯一真正影响速度的是在选择距离最小的记录之前计算所有记录的 Levenshtein 距离的函数(我通过将 0 代替对 Levenshtein 函数的调用来验证这一点)。该表有 150 万条记录,因此即使是最轻微的调整也可能会缩短几秒钟。现在整个过程持续了10多分钟。这是我正在使用的方法:
ALTER function dbo.Levenshtein
(
@Source nvarchar(200),
@Target nvarchar(200)
)
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)
SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0
WHILE @j <= @Target_len
BEGIN
SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END
WHILE @i <= @Source_len
BEGIN
SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @Target_len
BEGIN
SET @Dist = @Dist + 1
SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END
IF @Dist > @Dist_temp
BEGIN
SET @Dist = @Dist_temp
END
SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1
IF @Dist > @Dist_temp SET @Dist = @Dist_temp
BEGIN
SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
END
END
SELECT @Distv1 = @Distv0, @i = @i + 1
END
RETURN @Dist
END
我应该从这里去哪里?
I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:
ALTER function dbo.Levenshtein
(
@Source nvarchar(200),
@Target nvarchar(200)
)
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)
SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0
WHILE @j <= @Target_len
BEGIN
SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END
WHILE @i <= @Source_len
BEGIN
SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @Target_len
BEGIN
SET @Dist = @Dist + 1
SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END
IF @Dist > @Dist_temp
BEGIN
SET @Dist = @Dist_temp
END
SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1
IF @Dist > @Dist_temp SET @Dist = @Dist_temp
BEGIN
SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
END
END
SELECT @Distv1 = @Distv0, @i = @i + 1
END
RETURN @Dist
END
Where should I go from here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我过去这样做的方法是将“数据库”(实际上是用于拼写纠正器的单词词典)存储为特里树。
然后我使用分支定界例程来查找最近的匹配条目。对于小距离,所花费的时间与距离呈指数关系。对于长距离,它与字典的大小成线性关系,就像您现在看到的那样。
分支定界基本上是 trie 的深度优先树遍历,但有错误预算。在每个节点,您跟踪当前的编辑距离,如果超出预算,则修剪树的该分支。
首先,您以零预算进行步行。这只会找到完全匹配的结果。如果你没有找到匹配的,那么你就以 1 的预算走过去。这将在距离 1 处找到匹配项。如果找不到任何匹配项,则以预算 2 进行匹配,依此类推。这听起来效率很低,但由于每次步行比前一次花费的时间要多得多,因此时间主要由您最后一次步行决定。
添加:代码概要(请原谅我的 C):
基本上,您通过跳过字母并在同一节点搜索来模拟删除它。您可以通过降序排列 trie 而不前进 s 来模拟插入字母。您可以通过表现得好像字母匹配来模拟替换字母,即使事实并非如此。当你掌握了它的窍门后,你可以添加其他可能的不匹配,例如用 O 替换 0,用 L 或 I 替换 1 - 像这样的愚蠢的东西。
您可能想要添加一个字符数组参数来表示您在 trie 中找到的当前单词。
The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.
Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.
Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.
First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.
Added: outline of code (pardon my C):
Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.
You probably want to add a character array argument to represent the current word you are finding in the trie.