当前位置：文江博客话题详情

线性时间算法用于在字符串中查找大多数频繁的m字母substring

发布于 2025-01-21 23:22:16 字数 120 浏览 2 评论 0原文

假设我们有一个字母字符串，我们正在寻找大多数重复的m字母substring（1 =＆lt; m =＆lt; n）。我只是在寻找一种在线性时间解决此问题的算法。我已经到达后缀树。但是如何通过后缀树解决它？

多谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

儭儭莪哋寶赑 2025-01-28 23:22:16

想法

您还可以使用哈希功能解决它。

我们可以将字符串转换为基本b数字，其中b是质量数字。

例如，如果字符串仅由小写字母组成（26个字符az），那么我们可以选择b等于29。

然后，我们将字符串字符映射到相应的数字。例如：

a -> 1
b -> 2
c -> 3
...
z -> 26

字符串ABC将等于29^2*1 + 29^1*2 + 29^0*3 = 899在基本29中。

我们应该映射a - ＆gt; 1，但不是a - ＆gt; 0由于aaa和aa的哈希值在基本b中是相等的，这不应该是。

现在，我们可以比较基本b中的哈希值，而不是比较两个字符串。如果它们的哈希值相等，那么我们可以说它们是相等的。

由于哈希值可能很大，因此您可以使用它的模块一个大质量数字，例如mod 1e9+7。在这种情况下，两个不同字符串的实体性具有相同的哈希值。

算法

可以将算法描述为Bellow：

Let n-letter string be S
Let hash(s) be function to get hash value of string s
For each m-letter-substring of S, call it s
   Increase the number of occurrences of hash(s), let call its o(hash(s))
Result will be the m-letter-substring s with the maximum o(hash(s))

要计算hash（s），首先，我们构建数组h其中：

H[i] = (b^(i-1)*S[1] + b^(i-2)*S[2] + b^(i-3)*S[3] + ... + b^0*S[i]) % mod

shere s [i] is映射的字符串字符i-th的字符数量。

要计算b^x，我们可以计算数组powb其中：

powb[0] = 1; powb[i] = (powb[i - 1] * b) % mod

然后，对于子字符串s [l。字符串s的.r]

hash(s[l..r]) = (H[r] - H[l-1]*b^(r-l+1)) % mod

正如我们所看到的， hash（s）可能为负，在这种情况下，我们应该将mod hash添加（s）（hash（s） += mod）。

复杂

O(N) to calculate H, powb
O(N) to iterate every substring s
  For each s
    O(1) to calculate hash(s)
    O(log(N)) to calculate total occurrences of hash value (C++ map)

Total complexity: O(N log N)

Idea

You can also solve it with hash function.

We can convert strings to base b numbers where b is a prime number.

For example, if the strings only consist of lowercase alphabet (26 characters a-z) then we can choose b equals 29.

Then we map string characters to corresponding numbers. For example:

a -> 1
b -> 2
c -> 3
...
z -> 26

String abc will equals 29^2*1 + 29^1*2 + 29^0*3 = 899 in base 29.

We should map a -> 1 but not a -> 0 since hash value of aaa and aa will be equal in base b, which shouldn't be.

Now instead of compare two strings, we can compares their hash value in base b. If their hash value are equal then we can say they are equal.

Since hash value can be very large, you can use it's module a large prime number, for example mod 1e9+7. The posibility of two different strings have same hash value is very low in this case.

Algorithm

The algorithm can be described as bellow:

Let n-letter string be S
Let hash(s) be function to get hash value of string s
For each m-letter-substring of S, call it s
   Increase the number of occurrences of hash(s), let call its o(hash(s))
Result will be the m-letter-substring s with the maximum o(hash(s))

To calculate hash(s), first we build array H where:

H[i] = (b^(i-1)*S[1] + b^(i-2)*S[2] + b^(i-3)*S[3] + ... + b^0*S[i]) % mod

Here S[i] is the mapped number of character i-th of string S.

To calculate b^x, we can calculate array powb where:

powb[0] = 1; powb[i] = (powb[i - 1] * b) % mod

Then for a substring s[l..r] of string S,

hash(s[l..r]) = (H[r] - H[l-1]*b^(r-l+1)) % mod

As we can see, hash(s) can be negative, in this case we should add mod to hash(s) (hash(s) += mod).

Complexity

O(N) to calculate H, powb
O(N) to iterate every substring s
  For each s
    O(1) to calculate hash(s)
    O(log(N)) to calculate total occurrences of hash value (C++ map)

Total complexity: O(N log N)

回复收藏 0 原文

~没有更多了~