查找所有子字符串的编辑距离的算法
给定 2 个字符串 s
和 t
。我需要找到 s
到 t
的编辑距离(Levenshtein 距离)中的每个子字符串。实际上,我需要知道 s
中的每个 i
位置从位置 i
开始的所有子字符串的最小编辑距离是多少。
例如:
t = "ab"
s = "sdabcb"
我需要得到类似:
{2,1,0,2,2}
解释:
1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2
2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1
3th position:
distance("ab", "ab") = 0
...
minimum is 0
等等。
当然,我可以使用暴力算法来解决这个任务。但有没有更快的算法呢?
感谢您的帮助。
Given 2 strings s
and t
. I need to find for each substring in s
edit distance(Levenshtein distance) to t
. Actually I need to know for each i
position in s
what is the minimum edit distance for all substrings started at position i
.
For example:
t = "ab"
s = "sdabcb"
And I need to get something like:
{2,1,0,2,2}
Explanation:
1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2
2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1
3th position:
distance("ab", "ab") = 0
...
minimum is 0
and so on.
I can use brute force algorithm to solve this task, of course. But is there faster algorithm?
Thanks for help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在给定字符串中查找子字符串非常容易。
您采用正常的 Levenshtein 算法并稍作修改。
第一:
而不是用 0,1,2,3,4,5,... 填充矩阵的第一行
你完全用零填充它。 (绿色矩形)
第二:
然后运行算法。
第三:
您不是返回最后一行的最后一个单元格,而是搜索最后一行中的最小值并返回它。 (红色矩形)
示例:
针:“aba”,干草堆:“c abba c” -->结果 = 1(转换 abba -> aba)
我测试了它并且它有效。
这比您在问题中那样逐个字符地遍历字符串的建议要快得多。您只需创建一次矩阵。
To find substrings in a given string is very easy.
You take the normal Levenshtein algorithm and modify it slightly.
FIRST:
Instead of filling the first row of the matrix with 0,1,2,3,4,5,...
you fill it entirely with zeros. (green rectangle)
SECOND:
Then you run the algorithm.
THIRD:
Instead of returning the last cell of the last row you search for the smallest value in the last row and return it. (red rectangle)
Example:
needle: "aba", haystack: "c abba c" --> result = 1 (converting abba -> aba)
I tested it and it works.
This is much faster than your suggestion of stepping character by character through the string as you do in your question. You only create the matrix once.
Wagner-Fischer 算法“免费”为您提供所有前缀的答案。
http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
Wagner-Fischer 矩阵的最后一行包含从
s
的每个前缀到t
的编辑距离。因此,作为解决问题的第一步,对于每个
i
,运行 Wagner-Fischer 并选择最后一行中的最小元素。我很想知道是否有其他人知道(或能找到)更好的方法。
The Wagner-Fischer algorithm gives you the answer for all prefixes "for free".
http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
The last row of the Wagner-Fischer matrix contains the edit distance from each prefix of
s
tot
.So as a first crack at your problem, for each
i
, run Wagner-Fischer and select the smallest element in the last row.I will be curious to see if anyone else knows (or can find) a better approach.