使用 Rabin-Karp 搜索字符串中的多个模式

发布于 2024-08-03 05:52:38 字数 300 浏览 6 评论 0原文

根据维基百科条目关于Rabin-Karp字符串匹配算法,可以用来查看同时针对字符串中的几种不同模式,同时仍然保持线性复杂性。显然,当所有模式都具有相同长度时,这很容易完成,但我仍然不明白如何在同时搜索不同长度的模式时保持 O(n) 复杂性。有人可以解释一下吗?

编辑(2011 年 12 月):

维基百科文章已更新,不再声称在 O(n) 中匹配不同长度的多个模式。

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?

Edit (December 2011):

The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

山色无中 2024-08-10 05:52:38

我不确定这是否是正确的答案,但无论如何:

在构造哈希值时,我们可以检查字符串哈希集中是否存在匹配。又名,当前哈希值。哈希函数/代码通常作为循环实现,在该循环内我们可以插入快速查找。

当然,我们必须从字符串集中选择 m 以获得最大字符串长度。

更新:来自维基百科,

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

我们以 m 步计算当前哈希值。每一步都有一个临时哈希值,我们可以在哈希集中查找该值(复杂度为 O(1))。所有哈希值都具有相同的大小,即 32 位。

更新 2:摊销(平均)O(n) 时间复杂度?

上面我说过m必须有最大字符串长度。事实证明,我们可以利用相反的方法。
使用用于移位子字符串搜索的哈希和固定的 < code>m 大小我们可以实现 O(n) 复杂度。

如果我们有可变长度的字符串,我们可以将 m 设置为最小字符串长度。此外,在哈希集中,我们不会将哈希与整个字符串相关联,而是与它的前 m 个字符相关联。
现在,在搜索文本时,我们检查当前哈希是否在哈希集中,并检查关联的字符串是否匹配。

这种技术会增加误报,但平均时间复杂度为 O(n)。

I'm not sure if this is the correct answer, but anyway:

While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.

Of course, we must pick m to have the maximum string length from the set of strings.

Update: From Wikipedia,

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.

Update 2: an amortized (average) O(n) time complexity ?

Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.

If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.

This technique will increase the false alarms but on average it has O(n) time complexity.

伤感在游骋 2024-08-10 05:52:38

这是因为子字符串的哈希值在数学上是相关的。计算哈希 H(S,j)(从字符串 S 的第 j 个位置开始的字符的哈希)需要 O(m)长度为 m 的字符串上的时间。但是一旦有了这个,计算 H(S, j+1) 就可以在常数时间内完成,因为 H(S, j+1) 可以表示为H(S, j) 的函数。

O(m) + O(1) => O(m),即线性时间。

这里有一个链接,其中对此进行了描述更详细(参见“是什么让 Rabin-Karp 快速?”部分)

It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).

O(m) + O(1) => O(m), i.e. linear time.

Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文