使用 Rabin-Karp 搜索字符串中的多个模式

发布于 2024-08-03 05:52:38 字数 300 浏览 9 评论 0原文

根据维基百科条目关于Rabin-Karp字符串匹配算法，可以用来查看同时针对字符串中的几种不同模式，同时仍然保持线性复杂性。显然，当所有模式都具有相同长度时，这很容易完成，但我仍然不明白如何在同时搜索不同长度的模式时保持 O(n) 复杂性。有人可以解释一下吗？

编辑（2011 年 12 月）：

维基百科文章已更新，不再声称在 O(n) 中匹配不同长度的多个模式。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

山色无中 2024-08-10 05:52:38

我不确定这是否是正确的答案，但无论如何：

在构造哈希值时，我们可以检查字符串哈希集中是否存在匹配。又名，当前哈希值。哈希函数/代码通常作为循环实现，在该循环内我们可以插入快速查找。

当然，我们必须从字符串集中选择 m 以获得最大字符串长度。

更新：来自维基百科，

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

我们以 m 步计算当前哈希值。每一步都有一个临时哈希值，我们可以在哈希集中查找该值（复杂度为 O(1)）。所有哈希值都具有相同的大小，即 32 位。

更新 2：摊销（平均）O(n) 时间复杂度？

上面我说过m必须有最大字符串长度。事实证明，我们可以利用相反的方法。
使用用于移位子字符串搜索的哈希和固定的 < code>m 大小我们可以实现 O(n) 复杂度。

如果我们有可变长度的字符串，我们可以将 m 设置为最小字符串长度。此外，在哈希集中，我们不会将哈希与整个字符串相关联，而是与它的前 m 个字符相关联。
现在，在搜索文本时，我们检查当前哈希是否在哈希集中，并检查关联的字符串是否匹配。

这种技术会增加误报，但平均时间复杂度为 O(n)。

I'm not sure if this is the correct answer, but anyway:

While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.

Of course, we must pick m to have the maximum string length from the set of strings.

Update: From Wikipedia,

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.

Update 2: an amortized (average) O(n) time complexity ?

Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.

If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.

This technique will increase the false alarms but on average it has O(n) time complexity.

回复收藏 0 原文

伤感在游骋 2024-08-10 05:52:38

这是因为子字符串的哈希值在数学上是相关的。计算哈希 H(S,j)（从字符串 S 的第 j 个位置开始的字符的哈希）需要 O(m)长度为 m 的字符串上的时间。但是一旦有了这个，计算 H(S, j+1) 就可以在常数时间内完成，因为 H(S, j+1) 可以表示为H(S, j) 的函数。

O(m) + O(1) => O(m)，即线性时间。

这里有一个链接，其中对此进行了描述更详细（参见“是什么让 Rabin-Karp 快速？”部分）