如何修改 Damerau-Levenshtein 算法,使其还包括较大子字符串的开始索引和结束索引?
这是我的代码:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
必须有某种方法来获取子字符串 seq1 和父字符串 seq2 的起始和结束索引,对吧?
即使在阅读了有关该算法的维基文章之后,我也不完全确定该算法是如何工作的。我的意思是,我理解最高级别的解释,因为它发现插入、删除和转置差异(第二个循环中的行)..但除此之外。我有点失落。
这是我希望能够用这个做的一个例子(^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
索引应该是:
start: 5
end: 18 (last char of string)
理想情况下,search_string 永远不会被修改。但是,我想我可以取出所有空白字符(因为只有.. 3? \n \r 和 \t)存储每个空白字符的索引,获取子字符串的索引,然后重新 -添加空白字符,确保补偿子字符串的索引,因为我首先用最初存在的空白字符偏移它们。 -- 但如果这一切都可以用相同的方法完成,那就太棒了,因为算法已经是 O(n^2).. =(
在某些时候,我只想允许空格字符分割向上子字符串(s1)..但一次只做一件事
Here is my code:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
there has to be someway to get the starting and ending index of substring, seq1, withing parent string, seq2, right?
I'm not entirely sure how this algorithm works, even after reading the wiki article on it. I mean, I understand the highest level explanation, as it finds the insertion, deletion, and transposition difference (the lines in the second loop).. but beyond that. I'm a bit lost.
Here is an example of something that I wan to be able to do with this (^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
the indexes should be:
start: 5
end: 18 (last char of string)
Ideally, the search_string will never be modified. But, I guess I could take out all the white space characters (since there are only.. 3? \n \r and \t) store the indexes of each white space character, get the indexes of my substring, and then re-add in the white space characters, making sure to compensate the substring's indexes as I offset them with the white space characters that were originally in there in the first place. -- but if this could all be done in the same method, that would be amazing, as the algorithm is already O(n^2).. =(
At some point, I'd like to only allow white space characters to split up the substring (s1).. but one thing at a time
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这个算法不是你想要做的事情的正确选择。该算法只是根据将一个字符串转换为另一个字符串所需的修改次数来计算两个字符串之间的距离。如果为了简洁起见,我们将函数重命名为 dlmatch 并且仅返回距离,那么我们就得到了:
这意味着您可以通过 7 个步骤将一个字符串转换为另一个字符串(实际上是从第二个字符串中删除七个字符)。问题是 7 个步骤是一个相当大的差异:
这实际上意味着“hellothere”和“pandahere”比第一个示例更接近匹配。
如果您想要做的是“找到一个最匹配的子字符串”,我认为当您将第一个字符串提供给第二个字符串的一系列子字符串,然后选择时,您会陷入 O(n^3) 算法为您提供最接近匹配的子字符串。
或者,您最好尝试对搜索字符串进行预处理,然后与子字符串进行正则表达式匹配。例如,您可以去掉所有特殊字符,然后构建一个正则表达式来查找子字符串中不区分大小写且之间可以有任意数量空格的单词。
I don't think this algorithm is the right choice for what you want to do. The algorithm is simply calculating the distance between two strings in terms of the number of modifications you need to make to turn one string into another. If we rename your function to dlmatch for brevity and only return the distance, then we have:
meaning that you can convert one string into the other in 7 steps (effectively by removing seven characters from the second). The problem is that 7 steps is a pretty big difference:
This would actually imply that "hello there" and "panda here" are closer matches than the first example.
If what you are trying to do is "find a substring that mostly matches", I think you are stuck with an O(n^3) algorithm as you feed the first string to a series of substrings of the second string, and then selecting the substring that provides you the closest match.
Alternatively, you may be better off trying to do pre-processing on the search string and then doing regexp matching with the substring. For example, you could strip off all special characters and then build a regexp that looks for words in the substring that are case insensitive and can have any amount of whitespace between them.