使用回溯近似字符串匹配

发布于 2024-12-07 03:46:53 字数 217 浏览 2 评论 0原文

我想使用回溯来搜索长字符串中的所有子字符串,允许可变长度匹配 - 即允许最大给定数量的不匹配、插入和删除的匹配。我无法找到任何有用的例子。我发现的最接近的是这篇论文这里,但那是非常复杂。有人吗?

干杯,

马丁

I would like to use backtracking to search for all substrings in a long string allowing for variable length matches - that is matches allowing for a maximum given number of mismatches, insertions, and deletions. I have not been able to locate any useful examples. The closest I have found is this paper here, but that is terribly complex. Anyone?

Cheers,

Martin

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤独岁月 2024-12-14 03:46:54

我将从 Levenshtein 距离 算法开始,这是通过不匹配检查字符串相似性时的标准方法,插入和删除。

由于该算法使用自下而上的动态编程,因此您可能能够找到所有子字符串,而无需对每个潜在子串执行算法。

I would start with Levenshtein's distance algorithm, which is the standard approach when checking for string similarities via mismatch, insertion and deletion.

Since the algorithm uses bottom up dynamic programming, you'll probably be able to find all substrings without having to execute the algorithm for each potential substring.

骑趴 2024-12-14 03:46:54

我所知道的最好的算法是 A Fast Bit-基于动态规划的近似字符串匹配向量算法,作者:Gene Myers。给定要搜索的长度为 n 的文本、要搜索的长度为 m 的模式字符串以及最大不匹配/插入/删除次数 k,该算法需要时间 O(mn/w),其中 w 是计算机的字大小 (32或 64)。如果您对字符串算法了解很多,那么存在一种与 k 无关的时间算法实际上是相当令人难以置信的——在很长一段时间内,这似乎是一个不可能的目标。

我不知道上述算法的现有实现。如果您想要一个工具,agrep可能正是您所需要的。它使用了一种较早的算法,需要时间 O(mnk/w),但对于低 k 来说它已经足够快了——在最坏的情况下比回溯搜索领先几英里。

agrep 基于Shift-Or(或“Bitap”)算法,这是一种非常聪明的动态编程算法,它设法将其状态表示为整数中的位,并使用二进制加法来完成更新状态的大部分工作,这就是加速算法的原因32 倍或64 超过一个更典型的实现。 :) Myers 的算法也使用这个想法来得到它的 1/w 速度因子。

The nicest algorithm I'm aware of for this is A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming by Gene Myers. Given a text to search of length n, a pattern string to search for of length m and a maximum number of mismatches/insertions/deletions k, this algorithm takes time O(mn/w), where w is your computer's word size (32 or 64). If you know much about algorithms on strings, it's actually pretty incredible that an algorithm exists that takes time independent of k -- for a long time, this seemed an impossible goal.

I'm not aware of an existing implementation of the above algorithm. If you want a tool, agrep may be just what you need. It uses an earlier algorithm that takes time O(mnk/w), but it's fast enough for low k -- miles ahead of a backtracking search in the worst case.

agrep is based on the Shift-Or (or "Bitap") algorithm, which is a very clever dynamic programming algorithm that manages to represent its state as bits in an integer and get binary addition to do most of the work of updating the state, which is what speeds up the algorithm by a factor of 32 or 64 over a more typical implementation. :) Myers's algorithm also uses this idea to get its 1/w speed factor.

杀手六號 2024-12-14 03:46:53

算法

下面的函数 ff() 使用递归(即回溯)来解决您的问题。基本思想是,在对 f() 的任何调用开始时,我们尝试将原始“needle”字符串的后缀 t 与后缀 匹配。 >s 的“haystack”字符串,同时只允许每种类型进行一定数量的编辑操作。

// ss is the start of the haystack, used only for reporting the match endpoints.
void f(char* ss, char* s, char* t, int mm, int ins, int del) {
    while (*s && *s == *t) ++s, ++t;    // OK to always match longest segment
    if (!*t) printf("%d\n", s - ss);    // Matched; print endpoint of match
    if (mm && *s && *t) f(ss, s + 1, t + 1, mm - 1, ins, del);
    if (ins && *s) f(ss, s + 1, t, mm, ins - 1, del);
    if (del && *t) f(ss, s, t + 1, mm, ins, del - 1);
}

// Find all occurrences of t starting at any position in s, with at most
// mm mismatches, ins insertions and del deletions.
void ff(char* s, char* t, int mm, int ins, int del) {
    for (char* ss = s; *s; ++s) {
//      printf("Starting from offset %d...\n", s - ss);
        f(ss, s, t, mm, ins, del);
    }
}

调用示例:

ff("xxabcydef", "abcdefg", 1, 1, 1);

此输出:

9
9

因为有两种方法可以在“xxabcydef”中查找“abcdefg”,每种更改最多有 1 个,并且这两种方法都以位置 9 结束:

Haystack: xxabcydef-
Needle:     abc-defg

其中有 1 个插入(y)和 1 次删除(g),并且

Haystack: xxabcyde-f
Needle:     abc-defg

有 1 次插入(y)、1 次删除(f) ,以及 1 次替换gf

支配关系

为什么在第 3 行使用 while 循环贪婪地匹配两个字符串开头的尽可能多的字符实际上是安全的,这一点可能并不明显。事实上,这可能会减少将特定结束位置报告为匹配的次数,但它永远不会导致完全忘记结束位置 - 因为我们通常感兴趣只是无论是否存在在干草堆的给定位置结束的匹配,并且如果没有这个 while 循环,算法将总是花费针大小的指数时间,这似乎双赢。

由于支配关系,它可以保证工作。为了看到这一点,假设相反——它实际上是不安全的(即错过了一些匹配)。然后会出现一些匹配,其中两个字符串中的相同字符的初始段彼此不对齐,例如:

Haystack: abbbbc
Needle:   a-b-bc

但是,任何此类匹配都可以转换为具有相同每种类型操作数的另一个匹配,并且结束在同一位置,通过将间隙后面的最左边的字符分流到间隙的左侧:

Haystack: abbbbc
Needle:   ab--bc

如果您重复执行此操作,直到不再可能在不需要替换的情况下分流字符,您将获得一个匹配项,其中最大的初始段两个字符串中的相同字符都对齐other:

Haystack: abbbbc
Needle:   abb--c

我的算法会找到所有这样的匹配,因此它不会忽略任何匹配位置。

指数时间

与任何回溯程序一样,此函数将在某些输入上表现出指数减慢。当然,在您碰巧使用的输入上,这种情况可能不会发生,并且它的计算速度比 DP 算法的特定实现更快。

Algorithm

The function ff() below uses recursion (i.e. backtracking) to solve your problem. The basic idea is that at the start of any call to f(), we are trying to match a suffix t of the original "needle" string to a suffix s of the "haystack" string, while allowing only a certain number of each type of edit operation.

// ss is the start of the haystack, used only for reporting the match endpoints.
void f(char* ss, char* s, char* t, int mm, int ins, int del) {
    while (*s && *s == *t) ++s, ++t;    // OK to always match longest segment
    if (!*t) printf("%d\n", s - ss);    // Matched; print endpoint of match
    if (mm && *s && *t) f(ss, s + 1, t + 1, mm - 1, ins, del);
    if (ins && *s) f(ss, s + 1, t, mm, ins - 1, del);
    if (del && *t) f(ss, s, t + 1, mm, ins, del - 1);
}

// Find all occurrences of t starting at any position in s, with at most
// mm mismatches, ins insertions and del deletions.
void ff(char* s, char* t, int mm, int ins, int del) {
    for (char* ss = s; *s; ++s) {
//      printf("Starting from offset %d...\n", s - ss);
        f(ss, s, t, mm, ins, del);
    }
}

Example call:

ff("xxabcydef", "abcdefg", 1, 1, 1);

This outputs:

9
9

because there are two ways to find "abcdefg" in "xxabcydef" with at most 1 of each kind of change, and both of these ways end at position 9:

Haystack: xxabcydef-
Needle:     abc-defg

which has 1 insertion (of y) and 1 deletion (of g), and

Haystack: xxabcyde-f
Needle:     abc-defg

which has 1 insertion (of y), 1 deletion (of f), and 1 substitution of g to f.

Dominance Relation

It may not be obvious why it's actually safe to use the while loop on line 3 to greedily match as many characters as possible at the start of the two strings. In fact this may reduce the number of times that a particular end position will be reported as a match, but it will never cause an end position to be forgotten completely -- and since we're usually interested in just whether or not there is a match ending at a given position of the haystack, and without this while loop the algorithm would always take time exponential in the needle size, this seems a win-win.

It is guaranteed to work because of a dominance relation. To see this, suppose the opposite -- that it is in fact unsafe (i.e. misses some matches). Then there would be some match in which an initial segment of equal characters from both strings are not aligned to each other, for example:

Haystack: abbbbc
Needle:   a-b-bc

However, any such match can be transformed into another match having the same number of operations of each type, and ending at the same position, by shunting the leftmost character following a gap to the left of the gap:

Haystack: abbbbc
Needle:   ab--bc

If you do this repeatedly until it's no longer possible to shunt characters without requiring a substitution, you will get a match in which the largest initial segment of equal characters from both strings are aligned to each other:

Haystack: abbbbc
Needle:   abb--c

My algorithm will find all such matches, so it follows that no match position will be overlooked by it.

Exponential Time

Like any backtracking program, this function will exhibit exponential slowdowns on certain inputs. Of course, it may be that on the inputs you happen to use, this doesn't occur, and it works out faster than particular implementations of DP algorithms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文