当前位置：文江博客话题详情

Boyer Moore算法理解和例子？

发布于 2024-11-13 08:09:25 字数 281 浏览 3 评论 0原文

我在理解博耶摩尔字符串搜索算法方面面临问题。

我正在关注以下文件。链接

我无法弄清楚到底什么是真正的含义这里是 delta1 和 delta2，以及他们如何应用它来查找字符串搜索算法。语言看起来有点模糊……

如果有人能帮助我理解这一点，那将会非常有帮助。

或者，如果您知道任何其他易于理解的链接或文档，请分享。

提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半窗疏影 2024-11-20 08:09:25

Boyer-Moore 背后的见解是，如果您开始在字符串中以模式中的最后一个字符开始搜索模式，那么当遇到不匹配时，您可以向前跳转多个字符。

假设我们的模式 p 是字符序列 p1、p2、...、pn，我们是搜索字符串 s，当前与 p 对齐，以便 pn 位于 s< 中的索引 i 处/代码>。

例如：

s = WHICH FINALLY HALTS.  AT THAT POINT...
p = AT THAT
i =       ^

BM 论文做出了以下观察：

（1）如果我们尝试匹配不在 p 中的字符，那么我们可以向前跳转 n 个字符：

'F' 是不在 p 中，因此我们前进 n 个字符：

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =        AT THAT
i =              ^

(2) 如果我们尝试匹配最后一个位置为 k 的字符（从 < 末尾算起） code>p 那么我们可以向前跳转 k 个字符：

' 在 p 中的最后一个位置是距离末尾 4 个字符，因此我们前进 4 个字符：

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =            AT THAT
i =                  ^

现在我们从i开始向后扫描，直到我们成功或者遇到不匹配。
(3a) 如果从 p 开头起发生了 k 个字符，并且不匹配的字符不在 p 中，那么我们可以前进（在至少）k 个字符。

'L' 不在 p 中，并且与 p6 发生不匹配，因此我们可以前进（至少）6 个字符：

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                  AT THAT
i =                        ^

但是，我们实际上可以做得更好。
(3b) 因为我们知道在旧的 i 中我们已经匹配了一些字符（在本例中为 1）。如果匹配的字符与p的开头不匹配，那么我们实际上可以向前跳一点（这个额外的距离在论文中称为“delta2”）：

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                   AT THAT
i =                         ^

此时，观察（2 ）再次申请，给予

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                       AT THAT
i =                             ^

和宾果游戏！我们完成了。

The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.

Let's say our pattern p is the sequence of characters p1, p2, ..., pn and we are searching a string s, currently with p aligned so that pn is at index i in s.

E.g.:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p = AT THAT
i =       ^

The B-M paper makes the following observations:

(1) if we try matching a character that is not in p then we can jump forward n characters:

'F' is not in p, hence we advance n characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =        AT THAT
i =              ^

(2) if we try matching a character whose last position is k from the end of p then we can jump forward k characters:

' 's last position in p is 4 from the end, hence we advance 4 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =            AT THAT
i =                  ^

Now we scan backwards from i until we either succeed or we hit a mismatch.
(3a) if the mismatch occurs k characters from the start of p and the mismatched character is not in p, then we can advance (at least) k characters.

'L' is not in p and the mismatch occurred against p6, hence we can advance (at least) 6 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                  AT THAT
i =                        ^

However, we can actually do better than this.
(3b) since we know that at the old i we'd already matched some characters (1 in this case). If the matched characters don't match the start of p, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper):

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                   AT THAT
i =                         ^

At this point, observation (2) applies again, giving

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                       AT THAT
i =                             ^

and bingo! We're done.

回复收藏 0 原文

幽蝶幻影 2024-11-20 08:09:25

该算法基于一个简单的原理。假设我正在尝试匹配长度为 m 的子字符串。我将首先查看索引 m 处的字符。如果该字符不在我的字符串中，我知道我想要的子字符串不能以索引 1, 2, ... , m 处的字符开头。

如果该字符在我的字符串中，我将假设它位于字符串中的最后一个位置。然后我将跳回来并开始尝试从可能的起始位置匹配我的字符串。这条信息是我的第一个表。

一旦我从子字符串的开头开始匹配，当我发现不匹配时，我就不能从头开始。我可能会在不同的时间点开始部分完成一场比赛。例如，如果我尝试匹配 ananand 中的 anand 成功匹配 anan，请意识到以下 a不是 d，但我刚刚匹配了 an，因此我应该跳回尝试匹配子字符串中的第三个字符。 “如果我在匹配 x 个字符后失败，我可能会在匹配的第 y 个字符上”信息存储在第二个表中。

请注意，当我无法匹配时，第二个表会根据我刚刚匹配的内容知道我可能在匹配中进行了多远。第一个表根据我刚刚看到但未能匹配的字符知道我可能落后多远。您想使用这两条信息中较为悲观的一条。

考虑到这一点，算法的工作原理如下：

start at beginning of string
start at beginning of match
while not at the end of the string:
    if match_position is 0:
        Jump ahead m characters
        Look at character, jump back based on table 1
        If match the first character:
            advance match position
        advance string position
    else if I match:
        if I reached the end of the match:
           FOUND MATCH - return
        else:
           advance string position and match position.
    else:
        pos1 = table1[ character I failed to match ]
        pos2 = table2[ how far into the match I am ]
        if pos1 < pos2:
            jump back pos1 in string
            set match position at beginning
        else:
            set match position to pos2
FAILED TO MATCH

The algorithm is based on a simple principle. Suppose that I'm trying to match a substring of length m. I'm going to first look at character at index m. If that character is not in my string, I know that the substring I want can't start in characters at indices 1, 2, ... , m.

If that character is in my string, I'll assume that it is at the last place in my string that it can be. I'll then jump back and start trying to match my string from that possible starting place. This piece of information is my first table.

Once I start matching from the beginning of the substring, when I find a mismatch, I can't just start from scratch. I could be partially through a match starting at a different point. For instance if I'm trying to match anand in ananand successfully match, anan, realize that the following a is not a d, but I've just matched an, and so I should jump back to trying to match my third character in my substring. This, "If I fail after matching x characters, I could be on the y'th character of a match" information is stored in the second table.

Note that when I fail to match the second table knows how far along in a match I might be based on what I just matched. The first table knows how far back I might be based on the character that I just saw which I failed to match. You want to use the more pessimistic of those two pieces of information.

With this in mind the algorithm works like this:

start at beginning of string
start at beginning of match
while not at the end of the string:
    if match_position is 0:
        Jump ahead m characters
        Look at character, jump back based on table 1
        If match the first character:
            advance match position
        advance string position
    else if I match:
        if I reached the end of the match:
           FOUND MATCH - return
        else:
           advance string position and match position.
    else:
        pos1 = table1[ character I failed to match ]
        pos2 = table2[ how far into the match I am ]
        if pos1 < pos2:
            jump back pos1 in string
            set match position at beginning
        else:
            set match position to pos2
FAILED TO MATCH

回复收藏 0 原文

~没有更多了~