T-SQL 中的编辑距离
我对 T-SQL 计算 Levenshtein 距离的算法感兴趣。
I am interested in algorithm in T-SQL calculating Levenshtein distance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我对 T-SQL 计算 Levenshtein 距离的算法感兴趣。
I am interested in algorithm in T-SQL calculating Levenshtein distance.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(7)
我在TSQL中实现了Levenshtein编辑距离函数 进行了多项优化,提高了我所知道的其他版本的速度。 如果两个字符串的开头有共同的字符(共享前缀),结尾有共同的字符(共享后缀),并且当字符串很大并且提供了最大编辑距离时,速度的提高是显着的。 例如,当输入是两个非常相似的 4000 个字符串,并且指定最大编辑距离为 2 时,这几乎比
edit_distance_within
函数在接受的答案中,在 0.073 秒(73 毫秒)内返回答案,而不是 55 秒。 它的内存效率也很高,使用的空间等于两个输入字符串中较大的一个加上一些常量空间。 它使用表示列的单个 nvarchar“数组”,并在其中就地进行所有计算,以及一些辅助 int 变量。优化:
这里是代码(2014 年 1 月 20 日更新以加快速度):
正如此函数的注释中提到的,字符比较的区分大小写将遵循有效的排序规则。 默认情况下,SQL Server 的排序规则将导致不区分大小写的比较。
修改此函数以始终区分大小写的一种方法是向比较字符串的两个位置添加特定的排序规则。 但是,我还没有对此进行彻底测试,特别是对于数据库使用非默认排序规则时的副作用。
以下是如何更改这两行以强制区分大小写的比较:
和
I implemented the Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I'm aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a max edit distance is provided, the improvement in speed is significant. For example, when the inputs are two very similar 4000 character strings, and a max edit distance of 2 is specified, this is almost three orders of magnitude faster than the
edit_distance_within
function in the accepted answer, returning the answer in 0.073 seconds (73 milliseconds) vs 55 seconds. It's also memory efficient, using space equal to the larger of the two input strings plus some constant space. It uses a single nvarchar "array" representing a column, and does all computations in-place in that, plus some helper int variables.Optimizations:
Here is the code (updated 1/20/2014 to speed it up a bit more):
As mentioned in the comments of this function, the case sensitivity of the character comparisons will follow the collation that's in effect. By default, SQL Server's collation is one that will result in case insensitive comparisons.
One way to modify this function to always be case sensitive would be to add a specific collation to the two places where strings are compared. However, I have not thoroughly tested this, especially for side effects when the database is using a non-default collation.
These are how the two lines would be changed to force case sensitive comparisons:
and
Arnold Fribble 在 sqlteam.com/forums 上提出了两项提案,
这是 2006 年较年轻的一个:
Arnold Fribble had two proposals on sqlteam.com/forums
This is the younger one from 2006:
IIRC,使用 SQL Server 2005 及更高版本,您可以使用任何 .NET 语言编写存储过程: 在 SQL Server 2005 中使用 CLR 集成。 这样,编写一个计算 Levenstein 距离 的过程就不难了。
一个简单的你好,世界! 从帮助中提取:
然后在您的 SQL Server 中运行以下命令:
现在您可以测试运行它:
希望这会有所帮助。
IIRC, with SQL Server 2005 and later you can write stored procedures in any .NET language: Using CLR Integration in SQL Server 2005. With that it shouldn't be hard to write a procedure for calculating Levenstein distance.
A simple Hello, World! extracted from the help:
Then in your SQL Server run the following:
And now you can test run it:
Hope this helps.
您可以使用 Levenshtein 距离算法来比较字符串
您可以在 http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
(由 Joseph Gama 开发的函数)
用法:
该算法只是返回 stpe 计数,通过一步替换不同的字符来将一个字符串更改为另一个字符串
You can use Levenshtein Distance Algorithm for comparing strings
Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
(Function developped by Joseph Gama)
Usage :
The algorithm simply returns the stpe count to change one string into other by replacing a different character at one step
我也在寻找 Levenshtein 算法的代码示例,很高兴在这里找到它。 当然,我想了解该算法是如何工作的,并且我正在玩一点上面的示例之一,我正在玩一点由 Veve。 为了更好地理解代码,我用矩阵创建了一个 EXCEL。
FUZZY 与 FUZY 的距离
图片内容超过 1000 个单词。
通过这个 EXCEL,我发现有可能进行额外的性能优化。 右上红色区域中的所有值都不需要计算。 每个红色单元格的值都会导致左侧单元格的值加 1。这是因为,该区域中的第二个字符串始终比第一个字符串长,这会导致每个字符的距离增加 1。
您可以通过使用语句 IF @j <= @i 并在该语句之前增加 @i 的值来反映这一点。
I was looking for a code example for the Levenshtein algorithm, too, and was happy to find it here. Of course I wanted to understand how the algorithm is working and I was playing around a little bit with one of the above examples I was playing around a little bit that was posted by Veve. In order to get a better understanding of the code I created an EXCEL with the Matrix.
distance for FUZZY compared with FUZY
Images say more than 1000 words.
With this EXCEL I found that there was potential for additional performance optimization. All values in the upper right red area do not need to be calculated. The value of each red cell results in the value of the left cell plus 1. This is because, the second string will be always longer in that area than the first one, what increases the distance by the value of 1 for each character.
You can reflect that by using the statement IF @j <= @i and increasing the value of @i Prior to this statement.
在 TSQL 中,比较两个项目的最佳且最快的方法是在索引列上连接表的 SELECT 语句。 因此,如果您想受益于 RDBMS 引擎的优势,这就是我建议实现编辑距离的方式。 TSQL 循环也可以工作,但对于大容量比较,其他语言中的 Levenstein 距离计算将比 TSQL 更快。
我已经在几个系统中使用一系列针对专门为此目的而设计的临时表的联接来实现编辑距离。 它需要一些繁重的预处理步骤 - 临时表的准备 - 但它在大量比较时效果很好。
简而言之:预处理包括创建、填充和索引临时表。 第一个包含引用 ID、一个单字母列和一个 charindex 列。 该表是通过运行一系列插入查询来填充的,这些查询将每个单词拆分为字母(使用 SELECT SUBSTRING),以创建与源列表中的单词具有字母一样多的行(我知道,这是很多行,但 SQL Server 可以处理数十亿行)行)。 然后制作第二个包含 2 个字母列的表,另一个包含 3 个字母列的表,等等。最终结果是一系列表,其中包含每个单词的引用 ID 和子字符串,以及它们位置的引用在这个词中。
完成此操作后,整个游戏就是复制这些表并将它们与 GROUP BY 选择查询中的副本连接起来,该查询会计算匹配的数量。 这为每个可能的单词对创建了一系列度量,然后将其重新聚合为每对单词的单个 Levenstein 距离。
从技术上讲,这与 Levenstein 距离(或其变体)的大多数其他实现非常不同,因此您需要深入了解 Levenstein 距离的工作原理以及其设计原因。 还要研究替代方案,因为通过这种方法,您最终会得到一系列基础指标,这些指标可以帮助同时计算编辑距离的许多变体,从而为您提供有趣的机器学习潜在改进。
本页之前的答案已经提到的另一点:尝试尽可能多地进行预处理,以消除不需要距离测量的对。 例如,应该排除没有一个共同字母的一对两个单词,因为可以从字符串的长度获得编辑距离。 或者不要测量同一个单词的两个副本之间的距离,因为它本质上是 0。 或者在进行测量之前删除重复项,如果您的单词列表来自长文本,则相同的单词可能会出现多次,因此仅测量一次距离将节省处理时间等。
In TSQL the best and fastest way to compare two items are SELECT statements that join tables on indexed columns. Therefore this is how I suggest to implement the editing distance if you want to benefit from the advantages of a RDBMS engine. TSQL Loops will work too, but Levenstein distance calculations will be faster in other languages than in TSQL for large volume comparisons.
I have implemented the editing distance in several systems using series of Joins against temporary tables designed for that purpose only. It requires some heavy pre-processing steps - the preparation of the temporary tables - but it works very well with large number of comparisons.
In a few words: the pre-processing consists of creating, populating and indexing temp tables. The first one contains reference ids, a one-letter column and a charindex column. This table is populated by running a series of insert queries that split every word into letters (using SELECT SUBSTRING) to create as many rows as word in the source list have letters (I know, that's a lot of rows but SQL server can handle billions of rows). Then make a second table with a 2-letter column, another table with a 3-letter column, etc. The end results is a series of tables which contain reference ids and substrings of each the words, as well a the reference of their position in the word.
Once this is done, the whole game is about duplicating these tables and joining them against their duplicate in a GROUP BY select query which counts the number of matches. This creates a series of measures for every possible pair of words, which are then re-aggregated into a single Levenstein distance per pair of words.
Technically this is very different than most other implementations of the Levenstein distance (or its variants) so you need to deeply understand how the Levenstein distance works and why it was designed as it is. Investigate the alternatives as well because with that method you end up with a series of underlying metrics which can help calculate many variants of the editing distance at the same time, providing you with interesting machine learning potential improvements.
Another point already mentioned by previous answers in this page: try to pre process as much as possible to eliminate the pairs that do not require distance measurement. For example a pair of two words that have not a single letter in common should be excluded, because the editing distance can be obtained from the length of the strings. Or do not measure the distance between two copies of the same word, since it is 0 by nature. Or remove duplicates before doing the measurement, if your list of words comes from a long text it is likely that the same words will appear more than once, so measuring the distance only once will save processing time, etc.
我的 Azure Synapse 模组(更改为使用 SET 而不是 SELECT):
My mods for Azure Synapse (changed to use SET instead of SELECT):