尝试在 T-SQL 查询中使用编辑距离 - 请帮助优化
我正在尝试使用我在网上找到的编辑算法来计算与搜索项最接近的值。以实现模糊术语匹配。我当前的查询运行时间约为 45 秒。我希望我能优化它。我已经为计算其编辑值的字段添加了索引。我发现的 levenshtein 函数可能不是最优化的,我不相信它的实现。这是该函数:
CREATE FUNCTION [dbo].[LEVENSHTEIN]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
/*
Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama
http://www.merriampark.com/ldtsql.htm
Returns the Levenshtein Distance between strings s1 and s2.
Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
Translated to TSQL by Joseph Gama
Fixed by Herbert Oppolzer / devio
as described in http://devio.wordpress.com/2010/09/07/calculating-levenshtein-distance-in-tsql
*/
RETURNS INT AS
BEGIN
DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT,
@s_i NCHAR(1), @t_j NCHAR(1),@cost INT
--Step 1
SET @n = LEN(@s)
SET @m = LEN(@t)
SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))
IF @n = 0
BEGIN
SET @LD = @m
GOTO done
END
IF @m = 0
BEGIN
SET @LD = @n
GOTO done
END
--Step 2
SET @i = 0
WHILE @i <= @n BEGIN
SET @d = STUFF(@d,@i+1,1,NCHAR(@i)) --d(i, 0) = i
SET @i = @i+1
END
SET @i = 0
WHILE @i <= @m BEGIN
SET @d = STUFF(@d,@i*(@n+1)+1,1,NCHAR(@i)) --d(0, j) = j
SET @i = @i+1
END
--Step 3
SET @i = 1
WHILE @i <= @n BEGIN
SET @s_i = SUBSTRING(@s,@i,1)
--Step 4
SET @j = 1
WHILE @j <= @m BEGIN
SET @t_j = SUBSTRING(@t,@j,1)
--Step 5
IF @s_i = @t_j
SET @cost = 0
ELSE
SET @cost = 1
--Step 6
SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
NCHAR(dbo.MIN3(
UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1))+1,
UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1))+1,
UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i-1+1,1))+@cost)
))
SET @j = @j+1
END
SET @i = @i+1
END
--Step 7
SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))
done:
RETURN @LD
END
这是我正在使用的查询:
SELECT [Address], [dbo].[LEVENSHTEIN](@searchTerm, [Address]) As LevenshteinDistance
FROM Streets
Order By LevenshteinDistance
我不是 DBA,所以请原谅我对任何最佳实践的无知 - 这就是我来这里学习的原因:)。我真的不想在业务层中卸载此处理,并希望将其保留在数据层中,但只有 16k 记录需要 45 秒来处理,目前无法使用。这只是记录的一小部分,一旦我处理完输入文件,这些记录将构成整个数据存储。提前致谢。
I'm am trying to use a levenshtein algorithm I found on the 'net to calculate the closest value to a search term. In order to implement fuzzy term matching. My current query runs about 45 seconds long. I'm hoping I can optimize it. I've already added indexes for the fields that I'm calculated the levenshtein value for. The levenshtein function I found may not be the most optimized and I take no credit in it's implementation. Here is that function:
CREATE FUNCTION [dbo].[LEVENSHTEIN]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
/*
Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama
http://www.merriampark.com/ldtsql.htm
Returns the Levenshtein Distance between strings s1 and s2.
Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
Translated to TSQL by Joseph Gama
Fixed by Herbert Oppolzer / devio
as described in http://devio.wordpress.com/2010/09/07/calculating-levenshtein-distance-in-tsql
*/
RETURNS INT AS
BEGIN
DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT,
@s_i NCHAR(1), @t_j NCHAR(1),@cost INT
--Step 1
SET @n = LEN(@s)
SET @m = LEN(@t)
SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))
IF @n = 0
BEGIN
SET @LD = @m
GOTO done
END
IF @m = 0
BEGIN
SET @LD = @n
GOTO done
END
--Step 2
SET @i = 0
WHILE @i <= @n BEGIN
SET @d = STUFF(@d,@i+1,1,NCHAR(@i)) --d(i, 0) = i
SET @i = @i+1
END
SET @i = 0
WHILE @i <= @m BEGIN
SET @d = STUFF(@d,@i*(@n+1)+1,1,NCHAR(@i)) --d(0, j) = j
SET @i = @i+1
END
--Step 3
SET @i = 1
WHILE @i <= @n BEGIN
SET @s_i = SUBSTRING(@s,@i,1)
--Step 4
SET @j = 1
WHILE @j <= @m BEGIN
SET @t_j = SUBSTRING(@t,@j,1)
--Step 5
IF @s_i = @t_j
SET @cost = 0
ELSE
SET @cost = 1
--Step 6
SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
NCHAR(dbo.MIN3(
UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1))+1,
UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1))+1,
UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i-1+1,1))+@cost)
))
SET @j = @j+1
END
SET @i = @i+1
END
--Step 7
SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))
done:
RETURN @LD
END
And here is the query I'm using:
SELECT [Address], [dbo].[LEVENSHTEIN](@searchTerm, [Address]) As LevenshteinDistance
FROM Streets
Order By LevenshteinDistance
I'm not a DBA so please forgive my ignorance in any best practices - that's why I'm here to learn :). I really don't want to offload this processing in the business layer and am hoping to keep it in the data layer but with only 16k records taking 45 seconds to process it's currently not usable. This is only with a small subset of the records which will comprise the entire data store once I'm done processing the input files. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您希望它运行得非常快,请考虑用 C# 创建一个 dll。它将使您的速度提高 150% ;)
这是我的博客,向您解释如何做到这一点。
http://levenshtein.blogspot.com/2011/04/how -it-is-done.html
If you want it to run really fast, consider creating a dll in C#. It will improve your speed by 150% ;)
Here is my blog that explains you how to do it.
http://levenshtein.blogspot.com/2011/04/how-it-is-done.html
您应该阅读此主题和这些链接:http://www.vbforums.com/showthread。 php?t=575471
You should read this thread and those links: http://www.vbforums.com/showthread.php?t=575471